0% found this document useful (0 votes)
40 views72 pages

DM Lab Manual

The document is a lab manual for the Data Mining Lab course at KKR & KSR Institute of Technology and Sciences, outlining the course objectives, experiments, and expected outcomes for students. It includes a detailed index, program educational objectives, program outcomes, and specific experiments to be conducted using R programming. The manual aims to equip students with practical skills in data mining techniques and statistical analysis through hands-on experience.

Uploaded by

22jr1a1216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views72 pages

DM Lab Manual

The document is a lab manual for the Data Mining Lab course at KKR & KSR Institute of Technology and Sciences, outlining the course objectives, experiments, and expected outcomes for students. It includes a detailed index, program educational objectives, program outcomes, and specific experiments to be conducted using R programming. The manual aims to equip students with practical skills in data mining techniques and statistical analysis through hands-on experience.

Uploaded by

22jr1a1216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Department of Information Technology

LAB MANUAL
R20

DATA MINING LAB


[III B.TECH, II-SEM]

KKR & KSR INSTITUTE OF


TECHNOLOGY AND SCIENCES
Vinjanampadu, Guntur District- 522017 (A. P.)

Document NO: Date of issue: Compiled by Authorized by


KITS/IT/LAB
MANUAL/DM
Date of revision Verified by
Department of Information Technology
LAB MANUAL
R20
DATA MINING LAB
[III B.TECH, SEM-II]

INDEX
S.No Contents Page.
No
1 Institute Vision & Mission 3
2 Department Vision & Mission 3
3 Program Educational Objectives & Program 4-5
Outcomes
4 Program Specific Outcomes 6
5 Syllabus 7
6 Course Outcomes 8
7 List of Experiments 9
8 Course Outcomes of associated course 10
9 Experiment Mapping with Course Outcomes 10
Experiments
1 Implement all basic R commands
2 Interact data through .csv files (Import from and export to
.csv files).
3 Get and Clean data using swirl exercises. (Use ‘swirl’
package, library and install thattopic from swirl).
4 Visualize all Statistical measures (Mean, Mode, Median,
Range, Inter Quartile Rangeetc., using Histograms,
Boxplots and Scatter Plots).
5 Create a data frame with the following structure.

EMP ID EMP NAME SALARY START


DATE
1 Satish 5000 01-11-2013
2 Vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 Pallavi 4500 23-10-2000
a. Extract two column names using column name.
b. Extract the first two rows and then all columns.
c. Extract 3rd and 5th row with 2nd and 4th column
6 Write R Program using ‘apply’ group of functions to create
and apply normalization function on each of the numeric
variables/columns of iris dataset to transform them into
i. 0 to 1 range with min-max normalization.
ii. a value around 0 with z-score normalization
7 Create a data frame with 10 observations and 3 variables and
add new rows and columns to it using ‘rbind’ and ‘cbind’
function.
8 Write R program to implement linear and multiple regression
on ‘mtcars’ dataset to estimate the value of ‘mpg’ variable,
with best R2 and plot the original values in ‘green’ and
predicted values in ‘red’.
9 Implement k-means clustering using R.
10 Implement k-medoids clustering using R.
11 Implement density based clustering on iris dataset.
12 Implement decision trees using ‘readingSkills’ dataset.
13 Implement decision trees using ‘iris’ dataset using package
party and ‘rpart’.
14 Use a Corpus() function to create a data corpus then Build
a term Matrix and Reveal wordfrequencies.
ADDITIONAL EXPERIMENTS
15
16
17
KKR & KSR INSTITUTE OF TECHNOLOGY AND SCIENCES
(Approved by AICTE, New Delhi, Affiliated to JNTU Kakinada, Approved by NBA & NAAC)

DEPARTMENT OF Information Technology


INSTITUTE VISION
To become a knowledge centre for technical education and also to become the top
engineering college in the sunrise state of Andhra Pradesh.

INSTITUTE MISSION
1. To incorporate benchmarked teaching and learning pedagogies in curriculum.
2. To ensure all round development of students through judicious blend of curricular, co-
curricular and extra-curricular activities.
3. To support cross-cultural exchange of knowledge between industry and academy.
4. To provide higher/continued education and research opportunities to the employees of
the institution.

DEPARTMENT VISION
To commit itself to continuously improve its educational environment in order to develop
graduates with the strong academic and technical backgrounds needed to achieve distinction
and discipline.

DEPARTMENT MISSION
To provide a strong theoretical and practical education in a congenial environment so as to
enable the students to fulfill their educational and industrial needs.
PROGRAM EDUCATIONAL OBJECTIVES OF IT DEPARTMENT

PEO 1:

Domain Knowledge: Have a strong foundation in areas like mathematics, science and engineering fundamentals so
as to enable them to solve and analyze engineering problems and prepare them to careers, R&D and studies of higher
level.

PEO 2:
Professional Employment: Have an ability to analyze and understand the requirements of software, technical
specifications required and provide novel engineering solutions to the problems associated with hardware and
software.

PEO 3:

Higher Degrees: Have exposure to cutting edge technologies thereby making them to achieve excellence in the areas
of their studies.

PEO 4:

Engineering Citizenship: Work in teams on multi-disciplinary projects with effective communication skills and
leadership qualities.

PEO 5:
Lifelong Learning: Have a successful career wherein they strike a balance between ethical values and commercial
values.
PROGRAM OUTCOMES (PO’S)

1. Engineering knowledge:

Apply the knowledge of mathematics, science, engineering fundamentals, and an engineering specialization to the
solution of complex engineering problems.

2. Problem analysis:
Identify, formulate, research literature, and analyze complex engineering problems reaching substantiated
conclusions using first principles of mathematics, natural sciences, and engineering sciences.

3. Design/development of solutions:

Design solutions for complex engineering problems and design system components or processes that meet the
specified needs with appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.

4. Conduct investigations of complex problems:

Use research-based knowledge and research methods including design of experiments, analysis and interpretation
of data, and synthesis of the information to provide valid conclusions.

5. Modern tool usage:


Create, select, and apply appropriate techniques, resources, and modern engineering and IT tools including
prediction and modeling to complex engineering activities with an understanding of the limitations.

6. The engineer and society:

Apply reasoning informed by the contextual knowledge to assess societal, health, safety, legal and cultural issues
and the consequent responsibilities relevant to the professional engineering practice.

7. Environment and sustainability:


Understand the impact of the professional engineering solutions in societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable development.

8. Ethics:

Apply ethical principles and commit to professional ethics and responsibilities and norms of the engineering
practice.

9. Individual and team work:

Function effectively as an individual, and as a member or leader in diverse teams, and in multidisciplinary settings.

10. Communication:

Communicate effectively on complex engineering activities with the engineering community and with society at
large, such as, being able to comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.

11. Project management and finance:

Demonstrate knowledge and understanding of the engineering and management principles and apply these to one’s
own work, as a member and leader in a team, to manage projects and in multidisciplinary environments.

12. Life-long learning:


Recognize the need for, and have the preparation and ability to engage in independent and life-long learning in the
broadest context of technological change.

PROGRAM SPECIFIC OUTCOME (PSO’S)

PSO1: Application Development

Able to develop the business solutions through Latest Software Techniques and tools for real time Applications.

PSO2: Professional and Leadership

Able to practice the profession with ethical leadership as an entrepreneur through participation in various events
like Ideathon, Hackathon, project expos and workshops.

PSO3: Computing Paradigms

Ability to identify the evolutionary changes in computing using Data Sciences, Apps, Cloud computing and IoT.
III Year – II Semester L T P C
0 0 3 1.5
DATA MINING LAB
Course Objectives:
 To understand the mathematical basics quickly and covers each and every condition of data
mining in order to prepare for real-world problems
 The various classes of algorithms will be covered to give a foundation to further apply
knowledge to dive deeper into the different flavors of algorithms
 Students should aware of packages and libraries of R and also familiar with functions used in
R for visualization
 To enable students to use R to conduct analytics on large real life datasets
 To familiarize students with how various statistics like mean median etc and data can be
collected for data exploration in R
List of Experiments:
1. Implement all basic R commands.
2. Interact data through .csv files (Import from and export to .csv files).
3. Get and Clean data using swirl exercises. (Use ‘swirl’ package, library and install that topic
from swirl).
4. Visualize all Statistical measures (Mean, Mode, Median, Range, Inter Quartile Range etc.,
using Histograms, Boxplots and Scatter Plots).
5. Create a data frame with the following structure.

EMP ID EMP NAME SALARY START DATE


1 Satish 5000 01-11-2013
2 Vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 Pallavi 4500 23-10-2000

a. Extract two column names using column name.


b. Extract the first two rows and then all columns.
c. Extract 3 rd and 5th row with 2nd and 4th column.
6. Write R Program using ‘apply’ group of functions to create and apply normalization function on
each of the numeric variables/columns of iris dataset to transform them into
i. 0 to 1 range with min-max normalization.
ii. a value around 0 with z-score normalization.

7. Create a data frame with 10 observations and 3 variables and add new rows and columns to it
using ‘rbind’ and ‘cbind’ function.
8. Write R program to implement linear and multiple regression on ‘mtcars’ dataset to estimate the
value of ‘mpg’ variable, with best R2 and plot the original values in ‘green’ and predicted values in
‘red’.
9. Implement k-means clustering using R.
10. Implement k-medoids clustering using R.
11. implement density based clustering on iris dataset.
12. implement decision trees using ‘readingSkills’ dataset.
13. Implement decision trees using ‘iris’ dataset using package party and ‘rpart’.
14. Use a Corpus() function to create a data corpus then Build a term Matrix and Reveal word
Frequencies.

Course Outcomes: At the end of the course, student will be able to


CO327.1 Extend the functionality of R by using add-on packages
CO327.2 Examine data from files and other sources and perform various data manipulation tasks on
them
CO327.3 Code statistical functions in R
CO327.4 Use R Graphics and Tables to visualize results of various statistical operations on data
CO327.5 Apply the knowledge of R gained to data Analytics for real life applications

COURSE OUTCOMES OF ASSOCIATED COURSE( DATAMINING)


At the end of the course, the students will be able to:
CO322.1 Apply suitable pre-processing and visualization techniques for data analysis
CO322.2 Apply frequent pattern and association rule mining techniques for data analysis
CO322.3 Apply appropriate classification techniques for data analysis
CO322.4 Apply appropriate clustering techniques for data analysis

Associated Theory Course:

III Year – II Semester L T P C


4 2 0 3
DATA MINING

Course Objectives:

 To understand data warehouse concepts, architecture, business analysis and tools


 To understand data pre-processing and data visualization techniques
 To study algorithms for finding hidden and interesting patterns in data
 To understand and apply various classification and clustering techniques using tools

UNIT I

Data Warehousing, Business Analysis and On-Line Analytical Processing (OLAP): Basic Concepts,
Data Warehousing Components, Building a Data Warehouse, Database Architectures for Parallel
Processing, Parallel DBMS Vendors, Multidimensional Data Model, Data WarehouseSchemas for
Decision Support, Concept Hierarchies, Characteristics of OLAP Systems, Typical OLAP
Operations, OLAP and OLTP.

UNIT II

Data Mining – Introduction: Introduction to Data Mining Systems, Knowledge Discovery Process,
Data Mining Techniques, Issues, applications, Data Objects and attribute types, Statistical
description of data, Data Preprocessing – Cleaning, Integration, Reduction, Transformation and
discretization, Data Visualization, Data similarity and dissimilarity measures.

UNIT III
Data Mining - Frequent Pattern Analysis: Mining Frequent Patterns, Associations and Correlations,
Mining Methods, Pattern Evaluation Method, Pattern Mining in Multilevel, Multi- Dimensional
Space – Constraint Based Frequent Pattern Mining, Classification using Frequent Patterns

UNIT IV

Classification: Decision Tree Induction, Bayesian Classification, Rule Based Classification,


Classification by Back Propagation, Support Vector Machines, Lazy Learners, Model Evaluation
and Selection, Techniques to improve Classification Accuracy

UNIT V

Clustering: Clustering Techniques, Cluster analysis, Partitioning Methods, Hierarchical methods,


Density Based Methods, Grid Based Methods, Evaluation of clustering, Clustering high
dimensional data, Clustering with constraints, Outlier analysis, outlier detection methods.

Text Books:

1) Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Third Edition,
Elsevier, 2012.
2) Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining,
Pearson,2016.
DEPARTMENT OF INFORMATION TECHNOLOGY

List of Experiments

S.No Contents
1 Implement all basic R commands
2 Interact data through .csv files (Import from and export to .csv files).
3 Get and Clean data using swirl exercises. (Use ‘swirl’ package, library
and install thattopic from swirl).
4 Visualize all Statistical measures (Mean, Mode, Median, Range, Inter
Quartile Rangeetc., using Histograms, Boxplots and Scatter Plots).
5 Create a data frame with the following structure.

EMP ID EMP NAME SALARY START DATE


1 Satish 5000 01-11-2013
2 Vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 Pallavi 4500 23-10-2000
a. Extract two column names using column name.
b. Extract the first two rows and then all columns.
c. Extract 3rd and 5th row with 2nd and 4th column
6 Write R Program using ‘apply’ group of functions to create and apply
normalization function on each of the numeric variables/columns of iris
dataset to transform them into
i. 0 to 1 range with min-max normalization.
ii. a value around 0 with z-score normalization
7 Create a data frame with 10 observations and 3 variables and add new rows
and columns to it using ‘rbind’ and ‘cbind’ function.
8 Write R program to implement linear and multiple regression on ‘mtcars’
dataset to estimate the value of ‘mpg’ variable, with best R2 and plot the
original values in ‘green’ and predicted values in ‘red’.
9 Implement k-means clustering using R.
10 Implement k-medoids clustering using R.
11 Implement density based clustering on iris dataset.
12 Implement decision trees using ‘readingSkills’ dataset.
13 Implement decision trees using ‘iris’ dataset using package party and
‘rpart’.
14 Use a Corpus() function to create a data corpus then Build a term Matrix
and Reveal wordfrequencies.
15
16
17

11
Mapping Of Co’s With Lab Experiments:

EXPERIMENT C312.1 C312.2 C312.3 C312.4 C312.5

EX1 3

EX2 3

EX3 3

EX4 3

EX5 3

EX6 3

EX7 3

EX8 3

EX9 3

EX10 3

EX11 3

EX12 3

EX13 3

EX14 3

Level of Mapping: 1 – Slightly 2 – Moderate 3 – Highly

12
1. Implement all basic R commands.

Type 'demo()' for some demos, 'help()' for on-line help, or


'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
BASIC MATH:
> 1+2+3
[1] 6
> 3*7*2
[1] 42
> 4/3
[1] 1.333333
> (4*6)+5
[1] 29
> 4*(6+5)
[1] 44

VARIABLES:
> x<-2
>x
[1] 2
> y=5
>y
[1] 5
> 3<-z
Error in 3 <- z : invalid (do_set) left-hand side to assignment
>z
Error: object 'z' not found
> 3-> z
>z
[1] 3
> a<-b<-7
>a
[1] 7
>b
[1] 7
> assign("j",4)
>j
[1] 4
Removing Variable:
> rm(j)
>j
Error: object 'j' not found
> xyz<-5
> xyz
[1] 5
13
> XYZ
Error: object 'XYZ' not found
DATA TYPES:
> class(x)
[1] "numeric"
> is.numeric(x)
[1] TRUE
> i<-4L
>i
[1] 4
> is.integer(i)
[1] TRUE
> class(4L)
[1] "integer"
> class(2.8)
[1] "numeric"
> 4L*2.8
[1] 11.2
> 5L/2L
[1] 2.5
> class(5L/2L)
[1] "numeric"
> TRUE*5
[1] 5
> FALSE*5
[1] 0
Character Data:
>x<-data()
>x
> x<- "data"
>x
[1] "data"
> y<-factor("data")
>y
[1] data
Levels: data
> nchar(x)
[1] 4
> nchar("hello")
[1] 5
> nchar(3)
[1] 1
> nchar(452)
[1] 3
> nchar(y)
Error in nchar(y) : 'nchar()' requires a character vector
# Will not work for factor.
14
DATES:
> date1<-as.Date("2021-09-20")
> date1
[1] "2021-09-20"
> class(date1)
[1] "Date"
> as.numeric(date1)
[1] 18890
> date2<-as.POSIXct("2021-09-20")
> date2
[1] "2021-09-20 IST"
> class(date2)
[1] "POSIXct" "POSIXt"
LOGICAL:
> k<-TRUE
> class(k)
[1] "logical"
> 2==3
[1] FALSE
> #comments
> 2!=3
[1] TRUE
> 2<3
[1] TRUE
> 2>3
[1] FALSE
> "data"<"stats"
[1] TRUE
VECTORS:
> c(1,2,3,4)
[1] 1 2 3 4
>c
function (...) .Primitive("c")
> c("c","R","Python")
[1] "c" "R" "Python"
> x<-c(1,2,3,4)
>x
[1] 1 2 3 4
> x<-c(1,2,3,s)
Error: object 's' not found
> x<-c(1,2,3,3)
>x
[1] 1 2 3 3
> x+2
[1] 3 4 5 5
> x*2
[1] 2 4 6 6
15
> x/2
[1] 0.5 1.0 1.5 1.5
> sqrt(x)
[1] 1.000000 1.414214 1.732051 1.732051
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> -2:5
[1] -2 -1 0 1 2 3 4 5
> 5:-9
[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9
> x<-1:10
>x
[1] 1 2 3 4 5 6 7 8 9 10
> y<- -5:4
>y
[1] -5 -4 -3 -2 -1 0 1 2 3 4
> x+y
[1] -4 -2 0 2 4 6 8 10 12 14
> x-y
[1] 6 6 6 6 6 6 6 6 6 6
> z=x-y
>z
[1] 6 6 6 6 6 6 6 6 6 6
> x/2
[1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> x/y
[1] -0.2 -0.5 -1.0 -2.0 -5.0 Inf 7.0 4.0 3.0 2.5
> x^2
[1] 1 4 9 16 25 36 49 64 81 100
> length(x)
[1] 10
> length(x+y)
[1] 10
>x
[1] 1 2 3 4 5 6 7 8 9 10
> x+c(1,2)
[1] 2 4 4 6 6 8 8 10 10 12
> x<=5
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
> x<y
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> x<- 10:1
> y<- -4:5
>x
[1] 10 9 8 7 6 5 4 3 2 1
16
>y
[1] -4 -3 -2 -1 0 1 2 3 4 5
> any(x<y)
[1] TRUE
> all(x<y)
[1] FALSE
> q<-
c("hockey","football","baseball","curling","rugby","lacrosse","basketball","tennis","cricket","soccer")
> nchar(q)
[1] 6 8 8 7 5 8 10 6 7 6
> nchar(y)
[1] 2 2 2 2 1 1 1 1 1 1
>x
[1] 10 9 8 7 6 5 4 3 2 1
> x[1]
[1] 10
> x[1:2]
[1] 10 9
> x{c(1,5)}
Error: unexpected '{' in "x{"
> x[c(1,5)]
[1] 10 6
> c(one="a",two="b",three="c")
one two three
"a" "b" "c"
> w<-1:3
> names(w)
NULL
> names(w)<-c("a","b","c")
>w
abc
123
CALLING A FUNCTION:
>x
[1] 10 9 8 7 6 5 4 3 2 1
> mean(x)
[1] 5.5
> mode(x)
[1] "numeric"
> median(x)
[1] 5.5

FUNCTION DOCUMENTATION:
> apropos("mea")
[1] ".colMeans" ".rowMeans" "colMeans"
[4] "influence.measures" "kmeans" "mean"
[7] "mean.Date" "mean.default" "mean.difftime"
17
[10] "mean.POSIXct" "mean.POSIXlt" "rowMeans"
[13] "weighted.mean"
> ?'+'
Missing Data: NA
> z<-c(1,2,NA,8,3,NA,3)
>z
[1] 1 2 NA 8 3 NA 3
> is.na(z)
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE
NULL:
> z<-c(1,NULL,3)
>z
[1] 1 3
> d<-NULL
> is.null(d)
[1] TRUE

Data Frames:
Data frame is just like an Excel spreadsheet in that it has column and rows. In statistical terms, each
column is a variable and each row is an observation.
> x<- 10:1
> y<--4:3
>x
[1] 10 9 8 7 6 5 4 3 2 1
>y
[1] -4 -3 -2 -1 0 1 2 3
> y<--4:5
>y
[1] -4 -3 -2 -1 0 1 2 3 4 5
> q<-
c("hockey","football","baseball","curling","rugby","lacrosse","basketball","tennis","cricket","soccer")
[1] "hockey" "football" "baseball" "curling" "rugby" "lacrosse"
[7] "basketball" "tennis" "cricket" "soccer"
> theDF<-data.frame(x,y,q)
> theDF
x y q
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> theDF<-data.frame(First=x, Second=y,Third=q)
18
> theDF
> theDF
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> nrow(theDF)
[1] 10
> NCOL(theDF)
[1] 3
> dim.data.frame(theDF)
[1] 10 3
> dim(theDF)
[1] 10 3
> names(theDF)
[1] "First" "Second" "Third"
> names(theDF) [3]
[1] "Third"
> rownames(theDF)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
> rownames(theDF)<-c("one", "two","threee","four","five","six","seven","eight","nine","ten")
> row.names(theDF)
[1] "one" "two" "threee" "four" "five" "six" "seven" "eight"
[9] "nine" "ten"
> theDF
First Second Third
one 10 -4 hockey
two 9 -3 football
threee 8 -2 baseball
four 7 -1 curling
five 6 0 rugby
six 5 1 lacrosse
seven 4 2 basketball
eight 3 3 tennis
nine 2 4 cricket
ten 1 5 soccer
> rownames(theDF) <-NULL
> rownames(theDF)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
> head(theDF)
19
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
> head(theDF, n=7)
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
> tail(theDF)
First Second Third
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> class(theDF)
[1] "data.frame"
> theDF
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> theDF[3,2] #Third row, Second Column element
[1] -2
> theDF[3,2:3] # row 3, columns 2 through 3
Second Third
3 -2 baseball
> theDF[c(3,5),2] #rows 3 and 5, column 2
[1] -2 0
> theDF[c(3,5),2:3] # rows 3 and 5, column 2 through 3
Second Third
20
3 -2 baseball
5 0 rugby
> theDF$Third #only Third Column
[1] "hockey" "football" "baseball" "curling" "rugby" "lacrosse"
[7] "basketball" "tennis" "cricket" "soccer"
> theDF[,3]
[1] "hockey" "football" "baseball" "curling" "rugby" "lacrosse"
[7] "basketball" "tennis" "cricket" "soccer"
> theDF[,2:3] #column 2 through 3
Second Third
1 -4 hockey
2 -3 football
3 -2 baseball
4 -1 curling
5 0 rugby
6 1 lacrosse
7 2 basketball
8 3 tennis
9 4 cricket
10 5 soccer
> theDF[2,] #2 nd row
First Second Third
2 9 -3 football
> theDF[2:4,] # row 2 through 4
First Second Third
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
> theDF[,c("First","Third")] #access multiple column by name
First Third
1 10 hockey
2 9 football
3 8 baseball
4 7 curling
5 6 rugby
6 5 lacrosse
7 4 basketball
8 3 tennis
9 2 cricket
10 1 soccer

LISTS: List can store any number of items of any type.


Numeric or character or mixed.

> list(1,2,3) #creates a three element list


21
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

> list(c(1,2,3)) # creates a single element list where the only element is a vector that has 3 elements
[[1]]
[1] 1 2 3

> list3<-list(c(1,2,3),3:7) #creates a two elements list.


> list3
[[1]]
[1] 1 2 3

[[2]]
[1] 3 4 5 6 7

#two element list , first element is a data.frame, second element is a 10 element vector
> list(theDF, 1:10)
[[1]]
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer

[[2]]
[1] 1 2 3 4 5 6 7 8 9 10

> list5<-list(theDF, 1:10, list3)


> list5
[[1]]
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
22
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer

[[2]]
[1] 1 2 3 4 5 6 7 8 9 10

[[3]]
[[3]][[1]]
[1] 1 2 3

[[3]][[2]]
[1] 3 4 5 6 7

23
2) Interact data through .csv files (Import from and export to .csv files).
In R, we can read data from files stored outside the R environment. We can also write data into files
which will be stored and accessed by the operating system. R can read and write into various file formats
like csv, excel, xml etc.
Getting and Setting the Working Directory
You can check which directory the R workspace is pointing to using the getwd() function. You can also
set a new working directory using setwd()function.
> print(getwd())
[1] "C:/Users/Prasanna Kumar/Documents"
> data <- read.csv("2.csv") # Reading the Data from the .csv file
> print(data) # data
id name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Analyzing the CSV File
> print(ncol(data)) # number of columns
[1] 5
> print(nrow(data)) # number of rows
[1] 8
Get the maximum salary
>sal <- max(data$salary)
>print(sal)
[1] 843.25
Get all the people working in IT department
> retval <- subset( data, dept == "IT")
> retval
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
6 6 Nina 578.0 2013-05-21 IT

24
> info <- subset(data, salary > 600 & dept == "IT") # IT dept with salary>600
> print(info)
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
> retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
> retval
id name salary start_date dept
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance
Writing into a CSV File
R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This
file gets created in the working directory.
> write.csv(retval,"output.csv")
> newdata <- read.csv("output.csv")
> print(newdata)
X id name salary start_date dept
1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 5 Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance

25
3. Get and Clean data using swirl exercises. (Use ‘swirl’ package, library and install that
topic from swirl).

swirl is a software package for the R programming language that turns the R console into an interactive
learning environment. Users receive immediate feedback as they are guided through self-paced lessons
in data science and R programming.

The swirl R package makes it fun and easy to learn R programming and data science.
Step 1: Get R

In order to run swirl, you must have R 3.1.0 or later installed on your computer.
Step 2 (recommended): Get RStudio

In addition to R, it’s highly recommended that you install RStudio, which will make your experience with
R much more enjoyable.
Step 3: Install swirl

Open RStudio (or just plain R if you don't have RStudio) and type the following into the console:

> install.packages("swirl")

Note that the > symbol at the beginning of the line is R's prompt for you type something into the console.
Step 4: Start swirl

This is the only step that you will repeat every time you want to run swirl. First, you will load the package
using the library() function. Then you will call the function that starts the magic! Type the following,
pressing Enter after each line:
> library("swirl")
> swirl()
Step 5: Install an interactive course

The first time you start swirl, you'll be prompted to install a course. You can either install one of the
recommended courses or visit course repository for more options. There are even more courses available
from the Swirl Course Network.

If you'd like to install a course that is not part of our course repository, type ?InstallCourses at the R
prompt for a list of functions that will help to do so.

Getting and Cleaning Data


Installation
swirl::install_course("Getting and Cleaning Data")
Manual Installation

1. Download getting_and_cleaning_data.swc file.


2. Run swirl::install_course() in the R console.
26
3. Select the file you just downloaded

> swirl::install_course("Getting and Cleaning Data")


|============================================================| 100%

| Course installed successfully!

> library(swirl)
> swirl()
What shall I call you? Prasanna Kumar
... <-- That's your cue to press Enter to continue
Select 1, 2, or 3 and press Enter

1: Continue.
2: Proceed.
3: Let's get going!

Selection: 1
...

| Please choose a course, or type 0 to exit swirl.

1: Getting and Cleaning Data


2: R Programming
3: Take me to the swirl course repository!

Selection: 1

| Please choose a lesson, or type 0 to return to course menu.

1: Manipulating Data with dplyr


2: Grouping and Chaining with dplyr
3: Tidying Data with tidyr
4: Dates and Times with lubridate

Selection: 1
| Attempting to load lesson dependencies...

| This lesson requires the ‘dplyr’ package. Would you like me to install it for you
| now?

1: Yes
2: No

Selection: 1
package ‘purrr’ successfully unpacked and MD5 sums checked
package ‘generics’ successfully unpacked and MD5 sums checked
27
package ‘tidyselect’ successfully unpacked and MD5 sums checked
package ‘dplyr’ successfully unpacked and MD5 sums checked

| Package ‘dplyr’ loaded correctly!


| I've created a variable called path2csv, which contains the full file path to the
| dataset. Call read.csv() with two arguments, path2csv and stringsAsFactors =
| FALSE, and save the result in a new variable called mydf. Check ?read.csv if you
| need help.

> read.csv(path2csv,stringsAsFactors =FALSE)


| Nice try, but that's not exactly what I was hoping for. Try again. Or, type
| info() for more options.

| Store the result of read.csv(path2csv, stringsAsFactors = FALSE) in a new


| variable called mydf.

> mydf<-read.csv(path2csv,stringsAsFactors =FALSE)


| dim(mydf) will give you the dimensions of the dataset.

> dim(mydf)
[1] 225468 11

| Great job!

|====== | 8%
| Now use head() to preview the data.

> head(mydf)
X date time size r_version r_arch r_os package version
1 1 2014-07-08 00:54:41 80589 3.1.0 x86_64 mingw32 htmltools 0.2.4
2 2 2014-07-08 00:59:53 321767 3.1.0 x86_64 mingw32 tseries 0.10-32
3 3 2014-07-08 00:47:13 748063 3.1.0 x86_64 linux-gnu party 1.0-15
4 4 2014-07-08 00:48:05 606104 3.1.0 x86_64 linux-gnu Hmisc 3.14-4
5 5 2014-07-08 00:46:50 79825 3.0.2 x86_64 linux-gnu digest 0.6.4
6 6 2014-07-08 00:48:04 77681 3.1.0 x86_64 linux-gnu randomForest 4.6-7
country ip_id
1 US 1
2 US 2
3 US 3
4 US 3
5 CA 4
6 US 3

| You are quite good my friend!

> library(dplyr)

28
| Your dedication is inspiring!

> packageVersion("dplyr")
[1] ‘1.0.7’

...

|=========== | 15%
| The first step of working with data in dplyr is to load the data into what the
| package authors call a 'data frame tbl' or 'tbl_df'. Use the following code to
| create a new tbl_df called cran:
|
| cran <- tbl_df(mydf).

> cran <- tbl_df(mydf)

| Nice work!

|============ | 17%
| To avoid confusion and keep things running smoothly, let's remove the original
| data frame from your workspace with rm("mydf").

> rm("mydf")

| You got it!


|============== | 18%
| From ?tbl_df, "The main advantage to using a tbl_df over a regular data frame is
| the printing." Let's see what is meant by this. Type cran to print our tbl_df to
| the console.

> cran
# A tibble: 225,468 x 11
X date time size r_version r_arch r_os package version country ip_id
<int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 1 2014-0~ 00:54~ 8.06e4 3.1.0 x86_64 ming~ htmltoo~ 0.2.4 US 1
2 2 2014-0~ 00:59~ 3.22e5 3.1.0 x86_64 ming~ tseries 0.10-32 US 2
3 3 2014-0~ 00:47~ 7.48e5 3.1.0 x86_64 linu~ party 1.0-15 US 3
4 4 2014-0~ 00:48~ 6.06e5 3.1.0 x86_64 linu~ Hmisc 3.14-4 US 3
5 5 2014-0~ 00:46~ 7.98e4 3.0.2 x86_64 linu~ digest 0.6.4 CA 4
6 6 2014-0~ 00:48~ 7.77e4 3.1.0 x86_64 linu~ randomF~ 4.6-7 US 3
7 7 2014-0~ 00:48~ 3.94e5 3.1.0 x86_64 linu~ plyr 1.8.1 US 3
8 8 2014-0~ 00:47~ 2.82e4 3.0.2 x86_64 linu~ whisker 0.3-2 US 5
9 9 2014-0~ 00:54~ 5.93e3 NA NA NA Rcpp 0.10.4 CN 6
10 10 2014-0~ 00:15~ 2.21e6 3.0.2 x86_64 linu~ hflights 0.1 US 7
# .... with 225,458 more rows

| Your dedication is inspiring!


29
|=============== | 20%
| This output is much more informative and compact than what we would get if we
| printed the original data frame (mydf) to the console.

...

|================ | 22%
| First, we are shown the class and dimensions of the dataset. Just below that, we
| get a preview of the data. Instead of attempting to print the entire dataset,
| dplyr just shows us the first 10 rows of data and only as many columns as fit
| neatly in our console. At the bottom, we see the names and classes for any
| variables that didn't fit on our screen.

...

|================= | 23%
| According to the "Introduction to dplyr" vignette written by the package authors,
| "The dplyr philosophy is to have small functions that each do one thing well."
| Specifically, dplyr supplies five 'verbs' that cover most fundamental data
| manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().

...

|================== | 25%
| Use ?select to pull up the documentation for the first of these core functions.

> ?select

| Keep working like that and you'll get there!

|==================== | 27%
| Help files for the other functions are accessible in the same way.

...

|===================== | 28%
| As may often be the case, particularly with larger datasets, we are only
| interested in some of the variables. Use select(cran, ip_id, package, country) to
| select only the ip_id, package, and country variables from the cran dataset.

> select(cran, ip_id, package, country)


# A tibble: 225,468 x 3
ip_id package country
<int> <chr> <chr>
1 1 htmltools US
30
2 2 tseries US
3 3 party US
4 3 Hmisc US
5 4 digest CA
6 3 randomForest US
7 3 plyr US
8 5 whisker US
9 6 Rcpp CN
10 7 hflights US
# ... with 225,458 more rows

| That's correct!

|====================== | 30%
| The first thing to notice is that we don't have to type cran$ip_id, cran$package,
| and cran$country, as we normally would when referring to columns of a data frame.
| The select() function knows we are referring to columns of the cran dataset.

...

|======================= | 32%
| Also, note that the columns are returned to us in the order we specified, even
| though ip_id is the rightmost column in the original dataset.

...

|========================= | 33%
| Recall that in R, the `:` operator provides a compact notation for creating a
| sequence of numbers. For example, try 5:20.

31
4) Visualize all Statistical measures (Mean, Mode, Median, Range, Inter Quartile Range etc., using
Histograms, Boxplots and Scatter Plots).

MEAN
The mean of an observation variable is a numerical measure of the central location of the data values. It is
the sum of its data values divided by data count.
Hence, for a data sample of size n, its sample mean is defined as follows:

Problem
Find the mean eruption duration in the data set faithful.
>head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
Solution
We apply the mean function to compute the mean value of eruptions.
> duration = faithful$eruptions # the eruption durations
> mean(duration) # apply the mean function
[1] 3.4878
Answer
The mean eruption duration is 3.4878 minutes.

MEDIAN
The median of an observation variable is the value at the middle when the data is sorted in ascending order.
It is an ordinal measure of the central location of the data values.
Problem
Find the median of the eruption duration in the data set faithful.
Solution
We apply the median function to compute the median value of eruptions.
> duration = faithful$eruptions # the eruption durations
> median(duration) # apply the median function
[1] 4
Answer
The median of the eruption duration is 4 minutes.

MODE

It is the value that has the highest frequency in the given data set. The data set may have no mode if the
frequency of all data points is the same. Also, we can have more than one mode if we encounter two or
more data points having the same frequency. There is no inbuilt function for finding mode in R, so we can
create our own function for finding the mode or we can use the package called moodest.
32
>mode = function(){

return(sort(-table(faithful$eruption))[1])
}
>mode()
1.867
-8

QUARTILE
There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that
cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the
value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.
Problem
Find the quartiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the quartiles of eruptions.
> duration = faithful$eruptions # the eruption durations
> quantile(duration) # apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000
Answer
The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and 4.4543 minutes
respectively.

PERCENTILE
The nth percentile of an observation variable is the value that cuts off the first n percent of the data values
when it is sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles of eruptions with the desired percentage ratios.
> duration = faithful$eruptions # the eruption durations
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330
Answer
The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330 minutes
respectively.

RANGE
The range of an observation variable is the difference of its largest and smallest data values. It is a measure
of how far apart the entire data spreads in value.

Problem
Find the range of the eruption duration in the data set faithful.
Solution

33
We apply the max and min function to compute the largest and smallest values of eruptions, then take the
difference.
> duration = faithful$eruptions # the eruption durations
> max(duration) − min(duration) # apply the max and min functions
[1] 3.5
Answer
The range of the eruption duration is 3.5 minutes.

INTERQUARTILE RANGE
The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a
measure of how far apart the middle portion of data spreads in value.

Problem
Find the interquartile range of eruption duration in the data set faithful.
Solution
We apply the IQR function to compute the interquartile range of eruptions.
> duration = faithful$eruptions # the eruption durations
> IQR(duration) # apply the IQR function
[1] 2.2915
Answer
The interquartile range of eruption duration is 2.2915 minutes.

BOX PLOT
The box plot of an observation variable is a graphical representation based on its quartiles, as well as its
smallest and largest values. It attempts to provide a visual shape of the data distribution.
Problem
Find the box plot of the eruption duration in the data set faithful.
Solution
We apply the boxplot function to produce the box plot of eruptions.
> duration = faithful$eruptions # the eruption durations
> boxplot(duration, horizontal=TRUE) # horizontal box plot
Answer
The box plot of the eruption duration is:

34
HISTOGRAM
A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a
quantitative variable. The area of each bar is equal to the frequency of items found in each class.
Example
In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars
showing the number of eruptions classified according to their durations.
Problem
Find the histogram of the eruption durations in faithful.
Solution
We apply the hist function to produce the histogram of the eruptions variable.
> duration = faithful$eruptions
> hist(duration, right=FALSE) # apply the hist function , # intervals closed on the left

Answer
The histogram of the eruption durations is:

SCATTER PLOT
A scatter plot pairs up values of two quantitative variables in a data set and display them as geometric points
inside a Cartesian diagram.
Example
In the data set faithful, we pair up the eruptions and waiting values in the same observation as (x,y)
coordinates. Then we plot the points in the Cartesian plane. Here is a preview of the eruption data value
pairs with the help of the cbind function.

> duration = faithful$eruptions # the eruption durations


> waiting = faithful$waiting # the waiting interval
> head(cbind(duration, waiting))
duration waiting
[1,] 3.600 79
[2,] 1.800 54
[3,] 3.333 74
35
[4,] 2.283 62
[5,] 4.533 85
[6,] 2.883 55
Problem
Find the scatter plot of the eruption durations and waiting intervals in faithful. Does it reveal any
relationship between the variables?
Solution
We apply the plot function to compute the scatter plot of eruptions and waiting.
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting # the waiting interval
> plot(duration, waiting, # plot the variables
+ xlab="Eruption duration", # x−axis label
+ ylab="Time waited") # y−axis label
Answer
The scatter plot of the eruption durations and waiting intervals is as follows. It reveals a positive linear
relationship between them.

Enhanced Solution

We can generate a linear regression model of the two variables with the lm function, and then draw a trend
line with abline.

> abline(lm(waiting ~ duration))

36
37
5) Create a data frame with the following structure.
EMP ID EMP NAME SALARY START DATE
1 Satish 5000 01-11-2013
2 Vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 Pallavi 4500 23-10-2000

a. Extract two column names using column name.


b. Extract the first two rows and then all columns.
c. Extract 3rd and 5th row with 2nd and 4th column.

> emp_id<-1:5
> emp_name<-c("Satish","Vani","Ramesh","Praveen","Pallavi")
> Salary<-c(5000,7500,10000,9500,4500
> d1<-as.Date("01-11-2013")
> d2<-as.Date("05-06-2011")
> d3<-as.Date("21-09-1999")
> d4<-as.Date("13-09-2005")
> d5<-as.Date("23-10-2000")
> Start_Date<-c(d1,d2,d3,d4,d5)
> theDF<-data.frame(emp_id,emp_name,Salary,Start_Date)
> theDF
emp_id emp_name Salary Start_Date
1 1 Satish 5000 2013-11-01
2 2 Vani 7500 2011-06-05
3 3 Ramesh 10000 1999-09-21
4 4 Praveen 9500 2005-09-13
5 5 Pallavi 4500 2000-10-23

a. Extract two column names using column name.


> names(theDF)
[1] "emp_id" "emp_name" "Salary" "Start_Date"
> names(theDF)[1:2]
[1] "emp_id" "emp_name"
b. Extract the first two rows and then all columns.
> theDF[1:2,]
emp_id emp_name Salary Start_Date
1 1 Satish 5000 2013-11-01
2 2 Vani 7500 2011-06-05

c. Extract 3rd and 5th row with 2nd and 4th column.
> theDF[c(3,5),c(2,4)]
emp_name Start_Date
3 Ramesh 1999-09-21
5 Pallavi 2000-10-23
38
6) Write R Program using ‘apply’ group of functions to create and apply normalization function on
each of the numeric variables/columns of iris dataset to transform them into
i. 0 to 1 range with min-max normalization.
ii. a value around 0 with z-score normalization.

#view first six rows of iris dataset


head(iris)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species


#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa

i. 0 to 1 range with min-max normalization.

Min-Max Normalization

The formula for a min-max normalization is:

(X – min(X))/(max(X) – min(X))

For each value of a variable, we simply find how far that value is from the minimum value, then divide
by the range.

To implement this in R, we can define a simple function and then use lapply to apply that function to
whichever columns in the iris dataset we would like:

#define Min-Max normalization function


min_max_norm <- function(x) {
(x - min(x)) / (max(x) - min(x))
}

#apply Min-Max normalization to first four columns in iris dataset


iris_norm <- as.data.frame(lapply(iris[1:4], min_max_norm))

#view first six rows of normalized iris dataset


head(iris_norm)

# Sepal.Length Sepal.Width Petal.Length Petal.Width


#1 0.22222222 0.6250000 0.06779661 0.04166667
#2 0.16666667 0.4166667 0.06779661 0.04166667
#3 0.11111111 0.5000000 0.05084746 0.04166667
#4 0.08333333 0.4583333 0.08474576 0.04166667
39
#5 0.19444444 0.6666667 0.06779661 0.04166667
#6 0.30555556 0.7916667 0.11864407 0.12500000

Notice that each of the columns now have values that range from 0 to 1. Also notice that the fifth column
“Species” was dropped from this data frame. We can easily add it back by using the following code:

#add back Species column


iris_norm$Species <- iris$Species

#view first six rows of iris_norm


head(iris_norm)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 0.22222222 0.6250000 0.06779661 0.04166667 setosa
#2 0.16666667 0.4166667 0.06779661 0.04166667 setosa
#3 0.11111111 0.5000000 0.05084746 0.04166667 setosa
#4 0.08333333 0.4583333 0.08474576 0.04166667 setosa
#5 0.19444444 0.6666667 0.06779661 0.04166667 setosa
#6 0.30555556 0.7916667 0.11864407 0.12500000 setosa

ii. a value around 0 with z-score normalization

The drawback of the min-max normalization technique is that it brings the data values towards the mean.
If we want to make sure that outliers get weighted more than other values, a z-score standardization is a
better technique to implement.

The formula for a z-score standardization is:

(X – μ) / σ

For each value of a variable, we simply subtract the mean value of the variable, then divide by the
standard deviation of the variable.

To implement this in R, we have a few different options:

1. Standardize one variable

If we simply want to standardize one variable in a dataset, such as Sepal.Width in the iris dataset, we can
use the following code:

#standardize Sepal.Width
iris$Sepal.Width <- (iris$Sepal.Width - mean(iris$Sepal.Width)) / sd(iris$Sepal.Width)

head(iris)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species


#1 5.1 1.01560199 1.4 0.2 setosa
40
#2 4.9 -0.13153881 1.4 0.2 setosa
#3 4.7 0.32731751 1.3 0.2 setosa
#4 4.6 0.09788935 1.5 0.2 setosa
#5 5.0 1.24503015 1.4 0.2 setosa
#6 5.4 1.93331463 1.7 0.4 setosa

The values of Sepal.Width are now scaled such that the mean is 0 and the standard deviation is 1. We can
even verify this if we’d like:

#find mean of Sepal.Width


mean(iris$Sepal.Width)

#[1] 2.034094e-16 #basically zero

#find standard deviation of Sepal.Width


sd(iris$Sepal.Width)

#[1] 1

2. Standardize several variables using the scale function

To standardize several variables, we can simply use the scale function. For example, the following code
shows how to scale the first four columns of the iris dataset:

#standardize first four columns of iris dataset


iris_standardize <- as.data.frame(scale(iris[1:4]))

#view first six rows of standardized dataset


head(iris_standardize)

# Sepal.Length Sepal.Width Petal.Length Petal.Width


#1 -0.8976739 1.01560199 -1.335752 -1.311052
#2 -1.1392005 -0.13153881 -1.335752 -1.311052
#3 -1.3807271 0.32731751 -1.392399 -1.311052
#4 -1.5014904 0.09788935 -1.279104 -1.311052
#5 -1.0184372 1.24503015 -1.335752 -1.311052
#6 -0.5353840 1.93331463 -1.165809 -1.048667

41
7) Create a data frame with 10 observations and 3 variables and add new rows and columns to it
using ‘rbind’ and ‘cbind’ function.

>df1 = data.frame(name = c("Rahul","joe","Adam","Brendon","Srilakshmi","Prasanna


Kumar","Anitha","Bhanu","Rajesh","Priya"), married_year =
c(2016,2015,2016,2008,2007,2009,2011,2013,2014,2008),
Salary=c(10000,15000,12000,13000,14000,15000,12000,10000,11000,14000))
> df1
name married_year Salary
1 Rahul 2016 10000
2 joe 2015 15000
3 Adam 2016 12000
4 Brendon 2008 13000
5 Srilakshmi 2007 14000
6 Prasanna Kumar 2009 15000
7 Anitha 2011 12000
8 Bhanu 2013 10000
9 Rajesh 2014 11000
10 Priya 2008 14000
>father_name<-c("Gandhi","Jashua","God","Bush","Venkateswarlu","David",
"Anand","Bharath","Rupesh","Prem Sagar")
> father_name
[1] "Gandhi" "Jashua" "God" "Bush" "Venkateswarlu" "David"
[7] "Anand" "Bharath" "Rupesh" "Prem Sagar"
> cb_df1= cbind(df1,father_name) # cbind to add new column
> cb_df1
name married_year Salary father_name
1 Rahul 2016 10000 Gandhi
2 joe 2015 15000 Jashua
3 Adam 2016 12000 God
4 Brendon 2008 13000 Bush
5 Srilakshmi 2007 14000 Venkateswarlu
6 Prasanna Kumar 2009 15000 David
7 Anitha 2011 12000 Anand
8 Bhanu 2013 10000 Bharath
9 Rajesh 2014 11000 Rupesh
10 Priya 2008 14000 Prem Sagar

> rb<-c("Prakash",2011,15000,"Jeevan")
> rb
[1] "Prakash" "2011" "15000" "Jeevan"
> rb_df1<-rbind(cb_df1,rb) # rbind to add new row
> rb_df1
name married_year Salary father_name
1 Rahul 2016 10000 Gandhi
2 joe 2015 15000 Jashua
3 Adam 2016 12000 God
42
4 Brendon 2008 13000 Bush
5 Srilakshmi 2007 14000 Venkateswarlu
6 Prasanna Kumar 2009 15000 David
7 Anitha 2011 12000 Anand
8 Bhanu 2013 10000 Bharath
9 Rajesh 2014 11000 Rupesh
10 Priya 2008 14000 Prem Sagar
11 Prakash 2011 15000 Jeevan

43
8) Write R program to implement linear and multiple regression on ‘mtcars’ dataset to estimate the
value of ‘mpg’ variable, with best R2 and plot the original values in ‘green’ and predicted values in
‘red’.

Name Description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
qsec 1/4 mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburettors

If we are interested in the relationship between fuel efficiency (mpg) and weight (wt) we may start
plotting those variables with:
>plot(mpg ~ wt, data = mtcars, col=2)

44
The plots shows a (linear) relationship!. Then if we want to perform linear regression to determine the
coefficients of a linear model, we would use the lm function:

fit <- lm(mpg ~ wt, data = mtcars)

The ~ here means "explained by", so the formula mpg ~ wt means we are predicting mpg as explained by
wt. The most helpful way to view the output is with:

>summary(fit)

Which gives the output:

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom


Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

This provides information about:

 the estimated slope of each coefficient (wt and the y-intercept), which suggests the best-fit
prediction of mpg is 37.2851 + (-5.3445) * wt
 The p-value of each coefficient, which suggests that the intercept and weight are probably not due
to chance
 Overall estimates of fit such as R^2 and adjusted R^2, which show how much of the variation in
mpg is explained by the model

45
We could add a line to our first plot to show the predicted mpg:

abline(fit,col=3,lwd=2)

It is also possible to add the equation to that plot. First, get the coefficients with coef. Then using paste0
we collapse the coefficients with appropriate variables and +/-, to built the equation. Finally, we add it to
the plot using mtext:

bs <- round(coef(fit), 3)
lmlab <- paste0("mpg = ", bs[1],
ifelse(sign(bs[2])==1, " + ", " - "), abs(bs[2]), " wt ")
mtext(lmlab, 3, line=-2)

The result is:

Multiple Regression – Mathematical Formula

In multiple regression there are more than one predictor variable and one response variable, relation of the
variables is shown below:

Y = a + b1x1 + b2x2 + ……… bnxn, Where Y is response, a…..bn are coefficients and x1…..xn are
predictor variables.

Multiple Regression using R Programming

For this tutorial on Multiple Regression Analysis using R Programming, I am going to use mtcars dataset
and we will see How the Model is built for two and three predictor variables.

46
Case Study 1: Establishing Relationship between “mpg” as response variable and “disp”, “hp” as predictor
variables.

Step1: Load the required data

>data <- mtcars[,c("mpg","disp","hp")]


##From this command we are creating new data variable with all rows and only required columns
>head(data)

Step2: Build Model using lm() function

>model <- lm(mpg~disp+hp, data=data)


>summary(model)

As you can see in the summary output shown above, we have got the intercept value which is the value of
‘a’ in the equation and coefficients of “disp” and “hp” are -0.030346 and -0.024840 respectively. Therefore
the regression analysis equation will be:

mpg = 30.735904 + (-0.030346)disp + (-0.024840)hp

Using the above equation we can predict the value of mpg based on disp and hp.

Step3: Predicting the output.


>predict(model, newdata = data.frame(disp=140, hp=80))
Predicted Output Mileage is 24.50022
If you enter the values of disp and hp in the equation derived above you will get the same output.
Plotting the Regression:

47
>plot(model)

Output:

Case Study 2: Establishing Relationship between “mpg” as response variable and “disp”, “hp” and “wt”
as predictor variables.

>model1 <- lm(mpg~disp+hp+wt, data=mtcars)


>summary(model1)

Equation will be like:


mpg = 37.105505 + (-0.000937)disp + (-0.031157)hp + (-3.800891)wt

>predict(model1, newdata = data.frame(disp=160, hp=100, wt=2.5))

Predicted Output Mileage is 24.3377

48
9) Implement k-means clustering using R.

K Means is a clustering algorithm that repeatedly assigns a group amongst k groups present to a data point
according to the features of the point. It is a centroid-based clustering method.

Step 1

Iris dataset has 5 columns namely – Sepal length, Sepal width, Petal Length, Petal Width, and Species. Iris
is a flower and here in this dataset 3 of its species Setosa, Versicolor, Verginica are mentioned. We will
cluster the flowers according to their species. The code to load the dataset:

>data("iris")
>head(iris) #will show top 6 rows only

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 5.1 1.01560199 1.4 0.2 setosa
2 4.9 -0.13153881 1.4 0.2 setosa
3 4.7 0.32731751 1.3 0.2 setosa
4 4.6 0.09788935 1.5 0.2 setosa
5 5.0 1.24503015 1.4 0.2 setosa
6 5.4 1.93331463 1.7 0.4 setosa
Step 2

The next step is to separate the 3rd and 4th columns into separate object x as we are using the unsupervised
learning method. We are removing labels so that the huge input of petal length and petal width columns
will be used by the machine to perform clustering unsupervised.

>x=iris[,3:4] #using only petal length and width columns


>head(x)
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4

Step 3

The next step is to use the K Means algorithm. K Means is the method we use which has parameters (data,
no. of clusters or groups). Here our data is the x object and we will have k=3 clusters as there are 3 species
in the dataset.

Then the ‘cluster’ package is called. Clustering in R is done using this inbuilt package which will perform
all the mathematics. Clusplot function creates a 2D graph of the clusters.

>model=kmeans(x,3)
49
>library(cluster)
>clusplot(x,model$cluster)

Component 1 and Component 2 seen in the graph are the two components in PCA (Principal Component
Analysis) which is basically a feature extraction method that uses the important components and removes
the rest. It reduces the dimensionality of the data for easier KMeans application. All of this is done by the
cluster package itself in R.
Step 4

The next step is to assign different colors to the clusters and shading them hence we use the color and shade
parameters setting them to T which means true.

>clusplot(x,model$cluster,color=T,shade=T)

50
10) Implement k-medoids clustering using R.
K-medoids clustering is a technique in which we place each observation in a dataset into one of K clusters.
The end goal is to have K clusters in which the observations within each cluster are quite similar to each
other while the observations in different clusters are quite different from each other.
In practice, we use the following steps to perform K-means clustering:
1. Choose a value for K.
 First, we must decide how many clusters we’d like to identify in the data. Often we have to simply
test several different values for K and analyze the results to see which number of clusters seems to
make the most sense for a given problem.
2. Randomly assign each observation to an initial cluster, from 1 to K.
3. Perform the following procedure until the cluster assignments stop changing.
 For each of the K clusters, compute the cluster centroid. This is the vector of the p feature medians
for the observations in the kth cluster.
 Assign each observation to the cluster whose centroid is closest. Here, closest is defined using
Euclidean distance.
K-Medoids Clustering in R
Step 1: Load the Necessary Packages
First, we’ll load two packages that contain several useful functions for k-medoids clustering in R.
>library(factoextra)
>library(cluster)

Step 2: Load and Prep the Data


For this example we’ll use the USArrests dataset built into R, which contains the number of arrests per
100,000 residents in each U.S. state in 1973 for Murder, Assault, and Rape along with the percentage of
the population in each state living in urban areas, UrbanPop.
The following code shows how to do the following:
 Load the USArrests dataset
 Remove any rows with missing values
 Scale each variable in the dataset to have a mean of 0 and a standard deviation of 1
#load data
df <- USArrests

#remove rows with missing values


df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1


df <- scale(df)

#view first six rows of dataset


head(df)

Murder Assault UrbanPop Rape


Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
Arizona 0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
51
Colorado 0.02571456 0.3988593 0.8608085 1.864967207

Step 3: Find the Optimal Number of Clusters


To perform k-medoids clustering in R we can use the pam() function, which stands for “partitioning around
medians” and uses the following syntax:
pam(data, k, metric = “euclidean”, stand = FALSE)
where:
 data: Name of the dataset.
 k: The number of clusters.
 metric: The metric to use to calculate distance. Default is euclidean but you could also specify
manhattan.
 stand: Whether or not to standardize each variable in the dataset. Default is FALSE.
Since we don’t know beforehand how many clusters is optimal, we’ll create two different plots that can
help us decide:
1. Number of Clusters vs. the Total Within Sum of Squares
First, we’ll use the fviz_nbclust() function to create a plot of the number of clusters vs. the total within sum
of squares:
fviz_nbclust(df, pam, method = "wss")

The total within sum of squares will typically always increase as we increase the number of clusters, so
when we create this type of plot we look for an “elbow” where the sum of squares begins to “bend” or level
off.
The point where the plot bends is typically the optimal number of clusters. Beyond this number, overfitting
is likely to occur.
For this plot it appear that there is a bit of an elbow or “bend” at k = 4 clusters.
2. Number of Clusters vs. Gap Statistic
Another way to determine the optimal number of clusters is to use a metric known as the gap statistic,
which compares the total intra-cluster variation for different values of k with their expected values for a
distribution with no clustering.
We can calculate the gap statistic for each number of clusters using the clusGap() function from the cluster
package along with a plot of clusters vs. gap statistic using the fviz_gap_stat() function:
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(df,
FUN = pam,

52
K.max = 10, #max clusters to consider
B = 50) #total bootstrapped iterations

#plot number of clusters vs. gap statistic


fviz_gap_stat(gap_stat)

From the plot we can see that gap statistic is highest at k = 4 clusters, which matches the elbow method we
used earlier.
Step 4: Perform K-Medoids Clustering with Optimal K
Lastly, we can perform k-medoids clustering on the dataset using the optimal value for k of 4:
#make this example reproducible
set.seed(1)

#perform k-medoids clustering with k = 4 clusters


kmed <- pam(df, k = 4)

#view results
kmed

ID Murder Assault UrbanPop Rape


Alabama 1 1.2425641 0.7828393 -0.5209066 -0.003416473
Michigan 22 0.9900104 1.0108275 0.5844655 1.480613993
Oklahoma 36 -0.2727580 -0.2371077 0.1699510 -0.131534211
New Hampshire 29 -1.3059321 -1.3650491 -0.6590781 -1.252564419
Clustering vector:
Alabama Alaska Arizona Arkansas California
1 2 2 1 2
Colorado Connecticut Delaware Florida Georgia
2 3 3 2 1
Hawaii Idaho Illinois Indiana Iowa
3 4 2 3 4
Kansas Kentucky Louisiana Maine Maryland
3 3 1 4 2
Massachusetts Michigan Minnesota Mississippi Missouri
3 2 4 1 3
53
Montana Nebraska Nevada New Hampshire New Jersey
3 3 2 4 3
New Mexico New York North Carolina North Dakota Ohio
2 2 1 4 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
3 3 3 3 1
South Dakota Tennessee Texas Utah Vermont
4 1 2 3 4
Virginia Washington West Virginia Wisconsin Wyoming
3 3 4 4 3
Objective function:
build swap
1.035116 1.027102

Available components:
[1] "medoids" "id.med" "clustering" "objective" "isolation"
[6] "clusinfo" "silinfo" "diss" "call" "data"
Note that the four cluster centroids are actual observations in the dataset. Near the top of the output we can
see that the four centroids are the following states:
 Alabama
 Michigan
 Oklahoma
 New Hampshire

We can visualize the clusters on a scatterplot that displays the first two principal components on the axes
using the fivz_cluster() function:
#plot results of final k-medoids model
fviz_cluster(kmed, data = df)

We can also append the cluster assignments of each state back to the original dataset:
#add cluster assignment to original data
final_data <- cbind(USArrests, cluster = kmed$cluster)
54
#view final data
head(final_data)

Murder Assault UrbanPop Rape cluster


Alabama 13.2 236 58 21.2 1
Alaska 10.0 263 48 44.5 2
Arizona 8.1 294 80 31.0 2
Arkansas 8.8 190 50 19.5 1
California 9.0 276 91 40.6 2
Colorado 7.9 204 78 38.7 2

55
11) implement density based clustering on iris dataset.
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris versicolor)
and a multivariate dataset introduced by British statistician and biologist Ronald Fisher in his 1936 paper
The use of multiple measurements in taxonomic problems. Four features were measured from each sample
i.e length and width of the sepals and petals and based on the combination of these four features, Fisher
developed a linear discriminant model to distinguish the species from each other.
# Loading data
>data(iris)
# Structure
>str(iris)

Performing DBScan on Dataset

Using the DBScan Clustering algorithm on the dataset which includes 11 persons and 6 variables or
attributes

# Installing Packages
>install.packages("fpc")
# Loading package
>library(fpc)
# Remove label form dataset
>iris_1 <- iris[-5]
# Fitting DBScan clustering Model
# to training dataset
>set.seed(220) # Setting seed
>Dbscan_cl <- dbscan(iris_1, eps = 0.45, MinPts = 5)
>Dbscan_cl

# Checking cluster
Dbscan_cl$cluster

# Table
>table(Dbscan_cl$cluster, iris$Species)

56
# Plotting Cluster
>plot(Dbscan_cl, iris_1, main = "DBScan")

>plot(Dbscan_cl, iris_1, main = "Petal Width vs Sepal Length")

57
12) implement decision trees using ‘readingSkills’ dataset.

Step 1: Run the required libraries


library(datasets)
library(caTools)
install.packages("party")
library(party)
library(dplyr)
library(magrittr)
Step 2: Load the dataset readingSkills and execute head(readingSkills)
>data("readingSkills")
> head(readingSkills)

Format
A data frame with 200 observations on the following 4 variables.
nativeSpeaker: a factor with levels no and yes, where yes indicates that the child is a native
speaker of the language of the reading test.
age : age of the child in years.
shoeSize: shoe size of the child in cm.
score: raw score on the reading test.

Step 3: Splitting dataset into 4:1 ratio for train and test data
>sample_data = sample.split(readingSkills, SplitRatio = 0.8)
## sample_data<- sample(2,nrow(readingSkills), replace=TRUE, prob=c(0.8,0.2))
>train_data <- subset(readingSkills, sample_data == TRUE)
>test_data <- subset(readingSkills, sample_data == FALSE)

Step 4: Create the decision tree model using ctree and plot the model
>model<- ctree(nativeSpeaker ~ ., train_data)
>plot(model)
The basic syntax for creating a decision tree in R is:
>ctree(formula, data)
where, formula describes the predictor and response variables and data is the data set used. In this case
nativeSpeaker is the response variable and the other predictor variables are represented by ., hence when
we plot the model we get the following output.
Output:

58
Step 5: Making a prediction
# testing the people who are native speakers and those who are not
>predict_model<-predict(ctree_, test_data)
# creates a table to count how many are classified as native speakers and how many are not
>m_at <- table(test_data$nativeSpeaker, predict_model)
>m_at
The model has correctly predicted 13 people to be non-native speakers but classified an additional 13 to
be non-native, and the model by analogy has misclassified none of the passengers to be native speakers
when actually they are not.
Step 6: Determining the accuracy of the model developed
>ac_Test <- sum(diag(table_mat)) / sum(table_mat)
>print(paste('Accuracy for test is found to be', ac_Test))
Here the accuracy-test from the confusion matrix is calculated and is found to be 0.74.
Hence this model is found to predict with an accuracy of 74 %

59
13) Implement decision trees using ‘iris’ dataset using package party and ‘rpart’. (Recursive
Partitioning and Regression Trees)
Fit a rpart model
Usage:
rpart(formula, data, weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)
Arguments
formula: a formula, with a response but no interaction terms. If this a a data frame, that is taken as the
model frame.
data : an optional data frame in which to interpret the variables named in the formula.
weights: optional case weights.
subset : optional expression saying that only a subset of the rows of the data should be used in the fit.
na.action: the default action deletes all observations for which y is missing, but keeps those in which one
or more predictors are missing.
method: one of "anova", "poisson", "class" or "exp". If method is missing then the routine tries to make
an intelligent guess. If y is a survival object, then method = "exp" is assumed, if y has 2 columns then
method = "poisson" is assumed, if y is a factor then method = "class" is assumed, otherwise method =
"anova" is assumed. It is wisest to specify the method directly, especially as more criteria may added to
the function in future.
Alternatively, method can be a list of functions named init, split and eval. Examples are given in the file
‘tests/usersplits.R’ in the sources, and in the vignettes ‘User Written Split Functions’.
model : if logical: keep a copy of the model frame in the result? If the input value for model is a model
frame (likely from an earlier call to the rpart function), then this frame is used rather than constructing
new data.
x : keep a copy of the x matrix in the result.
y : keep a copy of the dependent variable in the result. If missing and model is supplied this defaults
to FALSE.
parms : optional parameters for the splitting function.
Anova splitting has no parameters.
Poisson splitting has a single parameter, the coefficient of variation of the prior distribution on the rates.
The default value is 1.
Exponential splitting has the same parameter as Poisson.
For classification splitting, the list can contain any of: the vector of prior probabilities (component prior),
the loss matrix (component loss) or the splitting index (component split). The priors must be positive and
sum to 1. The loss matrix must have zeros on the diagonal and positive off-diagonal elements. The
splitting index can be gini or information. The default priors are proportional to the data counts, the losses
default to 1, and the split defaults to gini.
control : a list of options that control details of the rpart algorithm. See rpart.control.
cost : a vector of non-negative costs, one for each variable in the model. Defaults to one for all
variables. These are scalings to be applied when considering splits, so the improvement on splitting on a
variable is divided by its cost in deciding which split to choose.
...
arguments to rpart.control may also be specified in the call to rpart. They are checked against the list of
valid arguments.
Details: This differs from the tree function in S mainly in its handling of surrogate variables. In most
details it follows Breiman et. al (1984) quite closely. R package tree provides a re-implementation of tree.
Value: An object of class rpart.
60
Program:
> library(rpart)
> install.packages('rpart.plot')
> library(rpart.plot)
>data<-iris
>head(data)

> dt3 = rpart(Species ~., control = rpart.control( minsplit = 10, maxdepth = 5),data=iris , method =
"poisson")
> dt3
n= 150

node), split, n, deviance, yval


* denotes terminal node

1) root 150 52.324810000 2.000000


2) Petal.Length< 2.45 50 0.004869366 1.009901 *
3) Petal.Length>=2.45 100 10.068000000 2.497512
6) Petal.Width< 1.75 54 1.935982000 2.091743
12) Petal.Length< 4.95 48 0.422411100 2.020619 *
13) Petal.Length>=4.95 6 0.531330400 2.615385 *
7) Petal.Width>=1.75 46 0.372588600 2.967742 *
> dt3 = rpart(Species ~., control = rpart.control( minsplit = 10, maxdepth = 5),data=iris , method =
"class")

> dt3
n= 150

node), split, n, loss, yval, (yprob)


* denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)


2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) *
3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000)
6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259)
12) Petal.Length< 4.95 48 1 versicolor (0.00000000 0.97916667 0.02083333) *
13) Petal.Length>=4.95 6 2 virginica (0.00000000 0.33333333 0.66666667) *
7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
>rpart.plot(dt3)

61
>rpart.plot(dt3,type=4,extra=1)

62
14. Use a Corpus() function to create a data corpus then Build a term Matrix and Reveal word
frequencies
corpus: Text Corpus Analysis
Text corpus data analysis, with full support for international text (Unicode). Functions for reading data
from newline-delimited 'JSON' files, for normalizing and tokenizing text, for searching for term
occurrences, and for computing term occurrence frequencies, including n-grams.
> install.packages("corpus")
package ‘corpus’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Prasanna Kumar\AppData\Local\Temp\Rtmps7sf41\downloaded_packages
> library(corpus)
> help(corpus)
The Corpus Package
Text corpus analysis functions Details:
This package contains functions for text corpus analysis. To create a text object, use the read_ndjson or
as_corpus_text function. To split text into sentences or token blocks, use text_split. To specify
preprocessing behavior for transforming a text into a token sequence, use text_filter. To tokenize text or
compute term frequencies, use text_tokens, term_stats or term_matrix. To search for or count specific
terms, use text_locate, text_count, or text_detect.

term_matrix {corpus}
Description: Tokenize a set of texts and compute a term frequency matrix.
Usage:
term_matrix(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, transpose = FALSE, ...)
term_counts(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, ...)
Arguments
x : a text vector to tokenize.
filter : if non-NULL, a text filter to to use instead of the default text filter for x.
ngrams: an integer vector of n-gram lengths to include, or NULL to use the select argument to determine
the n-gram lengths.
select :a character vector of terms to count, or NULL to count all terms that appear in x.
group : if non-NULL, a factor, character string, or integer vector the same length of x specifying the
grouping behavior.
transpose: a logical value indicating whether to transpose the result, putting terms as rows instead of
columns.
... : additional properties to set on the text filter.
Details:
term_matrix tokenizes a set of texts and computes the occurrence counts for each term, returning the result
as a sparse matrix (texts-by-terms). term_counts returns the same information, but in a data frame.
If ngrams is non-NULL, then multi-type n-grams are included in the output for all lengths appearing in the
ngrams argument. If ngrams is NULL but select is non-NULL, then all n-grams appearing in the select set
are included. If both ngrams and select are NULL, then only unigrams (single type terms) are included.
If group is NULL, then the output has one set of term counts for each input text. Otherwise, we convert
group to a factor and compute one set of term counts for each level. Texts with NA values for group get
skipped.
Value:
63
term_matrix with transpose = FALSE returns a sparse matrix in "dgCMatrix" format with one column for
each term and one row for each input text or (if group is non-NULL) for each grouping level. If filter$select
is non-NULL, then the column names will be equal to filter$select. Otherwise, the columns are assigned in
arbitrary order.
term_matrix with transpose = TRUE returns the transpose of the term matrix, in "dgCMatrix" format.
term_counts with group = NULL returns a data frame with one row for each entry of the term matrix, and
columns "text", "term", and "count" giving the text ID, term, and count. The "term" column is a factor with
levels equal to the selected terms. The "text" column is a factor with levels equal to
names(as_corpus_text(x)); calling as.integer on the "text" column converts from the factor values to the
integer row index in the term matrix.
term_counts with group non-NULL behaves similarly, but the result instead has columns named "group",
"term", and "count", with "group" giving the grouping level, as a factor.

Examples
text <- c("A rose is a rose is a rose.",
"A Rose is red, a violet is blue!",
"A rose by any other name would smell as sweet.")
term_matrix(text)

# select certain terms


term_matrix(text, select = c("rose", "red", "violet", "sweet"))

# specify a grouping factor


term_matrix(text, group = c("Good", "Bad", "Good"))

# include higher-order n-grams


term_matrix(text, ngrams = 1:3)

# select certain multi-type terms


term_matrix(text, select = c("a rose", "a violet", "sweet", "smell"))

# transpose the result


term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows

# data frame
head(term_counts(text), n = 10) # first 10 rows

# with grouping
term_counts(text, group = c("Good", "Bad", "Good"))

# taking names from the input


term_counts(c(a = "One sentence.", b = "Another", c = "!!"))

64
VIVA QUESTIONS

1. What is R Programming?
2. What are the different data objects in R?
3. What makes a valid variable name in R?
4. What is the main difference between an Array and a matrix?
5. Which data object in R is used to store and process categorical data?
6. How can you load and use csv file in R?
7. How do you get the name of the current working directory in R?
8. What is R Base package?
9. How R is used in logistic regression?
10. How do you access the element in the 2nd column and 4th row of a matrix named M?
11. What is recycling of elements in a vector? Give an example.
12. What are different ways to call a function in R?
13. What is lazy function evaluation in R?
14. How do you install a package in R?
15. Name a R packages which is used to read XML files.
16. Can we update and delete any of the elements in a list?
17. Give the general expression to create a matrix in R.
18. which function is used to create a boxplot graph in R?
19. In doing time series analysis, what does frequency = 6 means in the ts() function?
20. What is reshaping of data in R?
21. What is the output of runif(4)?
22. How to get a list of all the packages installed in R ?
23. What is expected from running the command - strsplit(x,"e")?
24. Give a R script to extract all the unique words in uppercase from the string - "The quick brown
fox jumps over the lazy dog".
25. Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[1]?
26. Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[[1]]?
27. What does unlist() do?
28. Give the R expression to get 26 or less heads from a 51 tosses of a coin using pbinom.
29. X is the vector c(5,9.2,3,8.51,NA), What is the output of mean(x)?
30. How do you convert the data in a JSON file to a data frame?
31. Give a function in R that replaces all missing values of a vector x with the sum of elements of that
vector?
32. What is the use of apply() in R?
33. Is an array a matrix or a matrix an array?
34. How to find the help page on missing values?
35. How do you get the standard deviation for a vector x?
36. How do you set the path for current working directory in R?
37. What is the difference between "%%" and "%/%"?
38. What does col.max(x) do?
39. Give the command to create a histogram.
40. How do you remove a vector from the R workspace?
41. List the data sets available in package "MASS"
42. List the data sets available in all available packages.
65
43. What is the use of the command - install.packages(file.choose(), repos=NULL)?
44. Give the command to check if the element 15 is present in vector x.
45. Give the syntax for creating scatterplot matrices.
46. What is the difference between subset() function and sample() function in R?
47. How do you check if "m" is a matrix data object in R?
48. What is the output for the below expression all(NA==NA)?
49. How to obtain the transpose of a matrix in R?
50. What is the use of "next" statement in R?

What is data warehouse?


A data warehouse is a electronic storage of an Organization’s historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.
What is the benefits of data warehouse?
A data warehouse helps to integrate data and store them historically so that we can analyze different
aspects of business including, performance analysis, trend, prediction etc. over a given time frame and
use the result of our analysis to improve the efficiency of business processes.
What is the difference between OLTP and OLAP?
OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and
analysis system on that data.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On
the other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT
operations.
What is data mart?
Data marts are generally designed for a single subject area. An organization may have data pertaining
to different departments like Finance, HR, Marketting etc. stored in data warehouse and each
department may have separate data marts. These data marts can be built on top of the data warehouse.
What is dimension?
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, “20kg of
Rice (Product) is sold to Ramesh (customer) on 5th April (date)”, then that gives a meaningful sense.
These product, customer and dates are some dimension that qualified the measure – 20kg.
Dimensions are mutually independent. Technically speaking, a dimension is a data element that
categorizes each item in a data set into non-overlapping regions.
What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not always)
numerical values that can be aggregated.
Briefly state different between data ware house & data mart?
Dataware house is made up of many datamarts. DWH contain many subject areas. but data mart
focuses on one subject area generally. e.g. If there will be DHW of bank then there can be one data
mart for accounts, one for Loans etc. This is high level definitions. Metadata is data about data. e.g. if
in data mart we are receving any file. then metadata will contain information like how many columns,
file is fix width/elimted, ordering of fileds, dataypes of field etc…
What is the difference between dependent data warehouse and independent data warehouse?
There is a third type of Datamart called Hybrid. The Hybrid datamart having source data from
Operational systems or external files and central Datawarehouse as well. I will definitely check for
Dependent and Independent Datawarehouses and update.
What are the storage models of OLAP?
66
ROLAP, MOLAP and HOLAP
What are CUBES?
A data cube stores data in a summarized version which helps in a faster analysis of data. The data is
stored in such a way that it allows reporting easily.
E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee.
Here, month and week could be considered as the dimensions of the cube.
What is MODEL in Data mining world?
Models in Data mining help the different algorithms in decision making or pattern matching. The
second stage of data mining involves considering various models and choosing the best one based on
their predictive performance.
Explain how to mine an OLAP cube.
A data mining extension can be used to slice the data the source cube in the order as discovered by
data mining. When a cube is mined the case table is a dimension.
Explain how to use DMX-the data mining query language.
Data mining extension is based on the syntax of SQL. It is based on relational concepts and mainly
used to create and manage the data mining models. DMX comprises of two types of statements: Data
definition and Data manipulation. Data definition is used to define or create new models, structures.
Define Rollup and cube.
Custom rollup operators provide a simple way of controlling the process of rolling up a member to its
parents values.The rollup uses the contents of the column as custom rollup operator for each member
and is used to evaluate the value of the member’s parents.
If a cube has multiple custom rollup formulas and custom rollup members, then the formulas are
resolved in the order in which the dimensions have been added to the cube.
Differentiate between Data Mining and Data warehousing.
Data warehousing is merely extracting data from different sources, cleaning the data and storing it in
the warehouse. Where as data mining aims to examine or explore the data using queries. These
queries can be fired on the data warehouse. Explore the data in data mining helps in reporting,
planning strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of projects and employees.
Using Data mining, one can use this data to generate different reports like profits generated etc.
What is Discrete and Continuous data in Data mining world?
Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous
data can be considered as data which changes continuously and in an ordered fashion. E.g. age
What is a Decision Tree Algorithm?
A decision tree is a tree in which every node is either a leaf node or a decision node. This tree takes an
input an object and outputs some decision. All Paths from root node to the leaf node are reached by
either using AND or OR or BOTH. The tree is constructed using the regularities of the data. The
decision tree is not affected by Automatic Data Preparation.
What is Naïve Bayes Algorithm?
Naïve Bayes Algorithm is used to generate mining models. These models help to identify
relationships between input columns and the predictable columns. This algorithm can be used in the
initial stage of exploration. The algorithm calculates the probability of every state of each input
column given predictable columns possible states. After the model is made, the results can be used for
exploration and making predictions.
Explain clustering algorithm.
Clustering algorithm is used to group sets of data with similar characteristics also called as clusters.
These clusters help in making faster decisions, and exploring data. The algorithm first identifies
67
relationships in a dataset following which it generates a series of clusters based on the relationships.
The process of creating clusters is iterative. The algorithm redefines the groupings to create clusters
that better represent the data.
Explain Association algorithm in Data mining?
Association algorithm is used for recommendation engine that is based on a market based analysis.
This engine suggests products to customers based on what they bought earlier. The model is built on a
dataset containing identifiers. These identifiers are both for individual cases and for the items that
cases contain. These groups of items in a data set are called as an item set. The algorithm traverses a
data set to find items that appear in a case. MINIMUM_SUPPORT parameter is used any associated
items that appear into an item set.
What are the goals of data mining?
Prediction, identification, classification and optimization
Is data mining independent subject?
No, it is interdisciplinary subject. includes, database technology, visualization, machine learning,
pattern recognition, algorithm etc.
What are different types of database?
Relational database, data warehouse and transactional database.
What are data mining functionality?
Mining frequent pattern, association rules, classification and prediction, clustering, evolution analysis
and outlier Analise
What are issues in data mining?
Issues in mining methodology, performance issues, user interactive issues, different source of data
types issues etc.
List some applications of data mining.
Agriculture, biological data analysis, call record analysis, DSS, Business intelligence system etc
What do you mean by interesting pattern?
A pattern is said to be interesting if it is 1. easily understood by human 2. valid 3. potentially useful 4.
novel
Why do we pre-process the data?
To ensure the data quality. [accuracy, completeness, consistency, timeliness, believability, interpret-
ability]
What are the steps involved in data pre-processing?
Data cleaning, data integration, data reduction, data transformation.
What is distributed data warehouse?
Distributed data warehouse shares data across multiple data repositories for the purpose of OLAP
operation.
Define virtual data warehouse.
A virtual data warehouse provides a compact view of the data inventory. It contains meta data and
uses middle-ware to establish connection between different data sources.
What is are different data warehouse model?
Enterprise data ware houst
Data marts
Virtual Data warehouse
List few roles of data warehouse manager.
Creation of data marts, handling users, concurrency control, updation etc,
What are different types of cuboids?
0-D cuboids are called as apex cuboids
68
n-D cuboids are called base cuboids
Middle cuboids
What are the forms of multidimensional model?
Star schema
Snow flake schema
Fact constellation Schema
What are frequent pattern?
A set of items that appear frequently together in a transaction data set.
eg milk, bread, sugar
What are the issues regarding classification and prediction?
Preparing data for classification and prediction
Comparing classification and prediction
Define model over fitting.
A model that fits training data well can have generalization errors. Such situation is called as model
over fitting.
What are the methods to remove model over fitting?
Pruning [Pre-pruning and post pruning)
Constraint in the size of decision tree
Making stopping criteria more flexible
What is regression?
Regression can be used to model the relationship between one or more independent and dependent
variables.
Linear regression and non-linear regression
Compare K-mean and K-mediods algorithm.
K-mediods is more robust than k-mean in presence of noise and outliers. K-Mediods can be
computationally costly.
What is K-nearest neighbor algorithm?
It is one of the lazy learner algorithm used in classification. It finds the k-nearest neighbor of the point
of interest.
What is Baye’s Theorem?
P(H/X) = P(X/H)* P(H)/P(X)

What is concept Hierarchy?


It defines a sequence of mapping from a set of low level concepts to higher -level, more general
concepts.
What are the causes of model over fitting?
Due to presence of noise
Due to lack of representative samples
Due to multiple comparison procedure
What is decision tree classifier?
A decision tree is an hierarchically based classifier which compares data with a range of properly
selected features.
If there are n dimensions, how many cuboids are there?
There would be 2^n cuboids.
What is spatial data mining?
Spatial data mining is the process of discovering interesting, useful, non-trivial patterns from large
spatial datasets.
69
Spatial Data Mining = Mining Spatial Data Sets (i.e. Data Mining + Geographic
Information Systems)
What is multimedia data mining?
Multimedia Data Mining is a subfield of data mining that deals with an extraction of implicit
knowledge, multimedia data relationships, or other patterns not explicitly stored in multimedia
databases
What are different types of multimedia data?
image, video, audio
What is text mining?
Text mining is the procedure of synthesizing information, by analyzing relations, patterns, and rules
among textual data. These procedures contains text summarization, text categorization, and text
clustering.
List some application of text mining.
Customer profile analysis
patent analysis
Information dissemination
Company resource planning
What do you mean by web content mining?
Web content mining refers to the discovery of useful information from Web contents, including text,
images, audio, video, etc.
Define web structure mining and web usage mining.
Web structure mining studies the model underlying the link structures of the Web. It has been used
for search engine result ranking and other Web applications.
Web usage mining focuses on using data mining techniques to analyze search logs to find
interesting patterns. One of the main applications of Web usage mining is its use to learn user
profiles.
What is data warehouse?
A data warehouse is a electronic storage of an Organization’s historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.
What is data warehouse?
A data warehouse is a electronic storage of an Organization’s historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.
What is data warehouse?
A data warehouse is a electronic storage of an Organization’s historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.
What is data warehouse?
A data warehouse is a electronic storage of an Organization’s historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.
What are frequent patterns?
These are the patterns that appear frequently in a data set.
item-set, sub sequence, etc
What is data warehouse?
A data warehouse is a electronic storage of an Organization’s historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.

What is data characterization?

70
Data Characterization is s summarization of the general features of a target class of data. Example,
analyzing software product with sales increased by 10%
What is data discrimination?
Data discrimination is the comparison of the general features of the target class objects against one or
more contrasting objects.
What can business analysts gain from having a data warehouse?
First, having a data warehouse may provide a competitive advantage by presenting relevant
information from which to measure performance and make critical adjustments in order to help win
over competitors.
Second, a data warehouse can enhance business productivity because it is able to quickly
and efficiently gather information that accurately describes the organization.
Third, a data warehouse facilitates customer relationship management because it provides a
consistent view of customers and item across all lines of business, all departments and all
markets.
Finally, a data warehouse may bring about cost reduction by tracking trends, patterns, and
exceptions over long periods in a consistent and reliable manner.
Why is association rule necessary?
In data mining, association rule learning is a popular and well researched method for discovering
interesting relations between variables in large databases.
It is intended to identify strong rules discovered in database using different measures of interesting.
What are two types of data mining tasks?
Descriptive task
Predictive task
Define classification.
Classification is the process of finding a model (or function) that describes and distinguishes data
classes or concepts.
What are outliers?
A database may contain data objects that do not comply with the general behavior or model of the
data. These data objects are called outliers.
What do you mean by evolution analysis?
Data evolution analysis describes and models regularities or trends for objects whose behavior change
over time.
Although this may include characterization, discrimination, association and
correlation analysis, classification, prediction, or clustering of time related data.
Distinct features of such as analysis include time-series data analysis, sequence or periodicity
pattern matching, and similarity-based data analysis.
Define KDD.
The process of finding useful information and patterns in data.
What are the components of data mining?
Database, Data Warehouse, World Wide Web, or other information repository
ØDatabase or Data Warehouse Server
ØKnowledge Based
ØData Mining Engine
ØPattern Evaluation Module
ØUser Interface
Define metadata.
A database that describes various aspects of data in the warehouse is called metadata.
71
What are the usage of metadata?
ØMap source system data to data warehouse tables
ØGenerate data extract, transform, and load procedures for import jobs
ØHelp users discover what data are in the data warehouse
ØHelp users structure queries to access data they need
List the demerits of distributed data warehouse.
ØThere is no metadata, no summary data or no individual DSS (Decision Support System)
integration or history. All queries must be repeated, causing additional burden on the system.
ØSince compete with production data transactions, performance can be degraded.
ØThere is no refreshing process, causing the queries to be very complex.
Define HOLAP.
The hybrid OLAP approach combines ROLAP and MOLAP technology.
What are data mining techniques?
Association rules
Classification and prediction
Clustering
Deviation detection
Similarity search
Sequence Mining
List different data mining tools.
Traditional data mining tools
Dashboards
Text mining tools
Define sub sequence.
A subsequence, such as buying first a PC, the a digital camera, and then a memory card, if it occurs
frequently in a shopping history database, is a (frequent) sequential pattern.
What is data warehouse?
A data warehouse is a electronic storage of an Organization’s historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.

72

You might also like