DM Lab Manual
DM Lab Manual
LAB MANUAL
R20
INDEX
S.No Contents Page.
No
1 Institute Vision & Mission 3
2 Department Vision & Mission 3
3 Program Educational Objectives & Program 4-5
Outcomes
4 Program Specific Outcomes 6
5 Syllabus 7
6 Course Outcomes 8
7 List of Experiments 9
8 Course Outcomes of associated course 10
9 Experiment Mapping with Course Outcomes 10
Experiments
1 Implement all basic R commands
2 Interact data through .csv files (Import from and export to
.csv files).
3 Get and Clean data using swirl exercises. (Use ‘swirl’
package, library and install thattopic from swirl).
4 Visualize all Statistical measures (Mean, Mode, Median,
Range, Inter Quartile Rangeetc., using Histograms,
Boxplots and Scatter Plots).
5 Create a data frame with the following structure.
INSTITUTE MISSION
1. To incorporate benchmarked teaching and learning pedagogies in curriculum.
2. To ensure all round development of students through judicious blend of curricular, co-
curricular and extra-curricular activities.
3. To support cross-cultural exchange of knowledge between industry and academy.
4. To provide higher/continued education and research opportunities to the employees of
the institution.
DEPARTMENT VISION
To commit itself to continuously improve its educational environment in order to develop
graduates with the strong academic and technical backgrounds needed to achieve distinction
and discipline.
DEPARTMENT MISSION
To provide a strong theoretical and practical education in a congenial environment so as to
enable the students to fulfill their educational and industrial needs.
PROGRAM EDUCATIONAL OBJECTIVES OF IT DEPARTMENT
PEO 1:
Domain Knowledge: Have a strong foundation in areas like mathematics, science and engineering fundamentals so
as to enable them to solve and analyze engineering problems and prepare them to careers, R&D and studies of higher
level.
PEO 2:
Professional Employment: Have an ability to analyze and understand the requirements of software, technical
specifications required and provide novel engineering solutions to the problems associated with hardware and
software.
PEO 3:
Higher Degrees: Have exposure to cutting edge technologies thereby making them to achieve excellence in the areas
of their studies.
PEO 4:
Engineering Citizenship: Work in teams on multi-disciplinary projects with effective communication skills and
leadership qualities.
PEO 5:
Lifelong Learning: Have a successful career wherein they strike a balance between ethical values and commercial
values.
PROGRAM OUTCOMES (PO’S)
1. Engineering knowledge:
Apply the knowledge of mathematics, science, engineering fundamentals, and an engineering specialization to the
solution of complex engineering problems.
2. Problem analysis:
Identify, formulate, research literature, and analyze complex engineering problems reaching substantiated
conclusions using first principles of mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions:
Design solutions for complex engineering problems and design system components or processes that meet the
specified needs with appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
Use research-based knowledge and research methods including design of experiments, analysis and interpretation
of data, and synthesis of the information to provide valid conclusions.
Apply reasoning informed by the contextual knowledge to assess societal, health, safety, legal and cultural issues
and the consequent responsibilities relevant to the professional engineering practice.
8. Ethics:
Apply ethical principles and commit to professional ethics and responsibilities and norms of the engineering
practice.
Function effectively as an individual, and as a member or leader in diverse teams, and in multidisciplinary settings.
10. Communication:
Communicate effectively on complex engineering activities with the engineering community and with society at
large, such as, being able to comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
Demonstrate knowledge and understanding of the engineering and management principles and apply these to one’s
own work, as a member and leader in a team, to manage projects and in multidisciplinary environments.
Able to develop the business solutions through Latest Software Techniques and tools for real time Applications.
Able to practice the profession with ethical leadership as an entrepreneur through participation in various events
like Ideathon, Hackathon, project expos and workshops.
Ability to identify the evolutionary changes in computing using Data Sciences, Apps, Cloud computing and IoT.
III Year – II Semester L T P C
0 0 3 1.5
DATA MINING LAB
Course Objectives:
To understand the mathematical basics quickly and covers each and every condition of data
mining in order to prepare for real-world problems
The various classes of algorithms will be covered to give a foundation to further apply
knowledge to dive deeper into the different flavors of algorithms
Students should aware of packages and libraries of R and also familiar with functions used in
R for visualization
To enable students to use R to conduct analytics on large real life datasets
To familiarize students with how various statistics like mean median etc and data can be
collected for data exploration in R
List of Experiments:
1. Implement all basic R commands.
2. Interact data through .csv files (Import from and export to .csv files).
3. Get and Clean data using swirl exercises. (Use ‘swirl’ package, library and install that topic
from swirl).
4. Visualize all Statistical measures (Mean, Mode, Median, Range, Inter Quartile Range etc.,
using Histograms, Boxplots and Scatter Plots).
5. Create a data frame with the following structure.
7. Create a data frame with 10 observations and 3 variables and add new rows and columns to it
using ‘rbind’ and ‘cbind’ function.
8. Write R program to implement linear and multiple regression on ‘mtcars’ dataset to estimate the
value of ‘mpg’ variable, with best R2 and plot the original values in ‘green’ and predicted values in
‘red’.
9. Implement k-means clustering using R.
10. Implement k-medoids clustering using R.
11. implement density based clustering on iris dataset.
12. implement decision trees using ‘readingSkills’ dataset.
13. Implement decision trees using ‘iris’ dataset using package party and ‘rpart’.
14. Use a Corpus() function to create a data corpus then Build a term Matrix and Reveal word
Frequencies.
Course Objectives:
UNIT I
Data Warehousing, Business Analysis and On-Line Analytical Processing (OLAP): Basic Concepts,
Data Warehousing Components, Building a Data Warehouse, Database Architectures for Parallel
Processing, Parallel DBMS Vendors, Multidimensional Data Model, Data WarehouseSchemas for
Decision Support, Concept Hierarchies, Characteristics of OLAP Systems, Typical OLAP
Operations, OLAP and OLTP.
UNIT II
Data Mining – Introduction: Introduction to Data Mining Systems, Knowledge Discovery Process,
Data Mining Techniques, Issues, applications, Data Objects and attribute types, Statistical
description of data, Data Preprocessing – Cleaning, Integration, Reduction, Transformation and
discretization, Data Visualization, Data similarity and dissimilarity measures.
UNIT III
Data Mining - Frequent Pattern Analysis: Mining Frequent Patterns, Associations and Correlations,
Mining Methods, Pattern Evaluation Method, Pattern Mining in Multilevel, Multi- Dimensional
Space – Constraint Based Frequent Pattern Mining, Classification using Frequent Patterns
UNIT IV
UNIT V
Text Books:
1) Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Third Edition,
Elsevier, 2012.
2) Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining,
Pearson,2016.
DEPARTMENT OF INFORMATION TECHNOLOGY
List of Experiments
S.No Contents
1 Implement all basic R commands
2 Interact data through .csv files (Import from and export to .csv files).
3 Get and Clean data using swirl exercises. (Use ‘swirl’ package, library
and install thattopic from swirl).
4 Visualize all Statistical measures (Mean, Mode, Median, Range, Inter
Quartile Rangeetc., using Histograms, Boxplots and Scatter Plots).
5 Create a data frame with the following structure.
11
Mapping Of Co’s With Lab Experiments:
EX1 3
EX2 3
EX3 3
EX4 3
EX5 3
EX6 3
EX7 3
EX8 3
EX9 3
EX10 3
EX11 3
EX12 3
EX13 3
EX14 3
12
1. Implement all basic R commands.
VARIABLES:
> x<-2
>x
[1] 2
> y=5
>y
[1] 5
> 3<-z
Error in 3 <- z : invalid (do_set) left-hand side to assignment
>z
Error: object 'z' not found
> 3-> z
>z
[1] 3
> a<-b<-7
>a
[1] 7
>b
[1] 7
> assign("j",4)
>j
[1] 4
Removing Variable:
> rm(j)
>j
Error: object 'j' not found
> xyz<-5
> xyz
[1] 5
13
> XYZ
Error: object 'XYZ' not found
DATA TYPES:
> class(x)
[1] "numeric"
> is.numeric(x)
[1] TRUE
> i<-4L
>i
[1] 4
> is.integer(i)
[1] TRUE
> class(4L)
[1] "integer"
> class(2.8)
[1] "numeric"
> 4L*2.8
[1] 11.2
> 5L/2L
[1] 2.5
> class(5L/2L)
[1] "numeric"
> TRUE*5
[1] 5
> FALSE*5
[1] 0
Character Data:
>x<-data()
>x
> x<- "data"
>x
[1] "data"
> y<-factor("data")
>y
[1] data
Levels: data
> nchar(x)
[1] 4
> nchar("hello")
[1] 5
> nchar(3)
[1] 1
> nchar(452)
[1] 3
> nchar(y)
Error in nchar(y) : 'nchar()' requires a character vector
# Will not work for factor.
14
DATES:
> date1<-as.Date("2021-09-20")
> date1
[1] "2021-09-20"
> class(date1)
[1] "Date"
> as.numeric(date1)
[1] 18890
> date2<-as.POSIXct("2021-09-20")
> date2
[1] "2021-09-20 IST"
> class(date2)
[1] "POSIXct" "POSIXt"
LOGICAL:
> k<-TRUE
> class(k)
[1] "logical"
> 2==3
[1] FALSE
> #comments
> 2!=3
[1] TRUE
> 2<3
[1] TRUE
> 2>3
[1] FALSE
> "data"<"stats"
[1] TRUE
VECTORS:
> c(1,2,3,4)
[1] 1 2 3 4
>c
function (...) .Primitive("c")
> c("c","R","Python")
[1] "c" "R" "Python"
> x<-c(1,2,3,4)
>x
[1] 1 2 3 4
> x<-c(1,2,3,s)
Error: object 's' not found
> x<-c(1,2,3,3)
>x
[1] 1 2 3 3
> x+2
[1] 3 4 5 5
> x*2
[1] 2 4 6 6
15
> x/2
[1] 0.5 1.0 1.5 1.5
> sqrt(x)
[1] 1.000000 1.414214 1.732051 1.732051
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> -2:5
[1] -2 -1 0 1 2 3 4 5
> 5:-9
[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9
> x<-1:10
>x
[1] 1 2 3 4 5 6 7 8 9 10
> y<- -5:4
>y
[1] -5 -4 -3 -2 -1 0 1 2 3 4
> x+y
[1] -4 -2 0 2 4 6 8 10 12 14
> x-y
[1] 6 6 6 6 6 6 6 6 6 6
> z=x-y
>z
[1] 6 6 6 6 6 6 6 6 6 6
> x/2
[1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> x/y
[1] -0.2 -0.5 -1.0 -2.0 -5.0 Inf 7.0 4.0 3.0 2.5
> x^2
[1] 1 4 9 16 25 36 49 64 81 100
> length(x)
[1] 10
> length(x+y)
[1] 10
>x
[1] 1 2 3 4 5 6 7 8 9 10
> x+c(1,2)
[1] 2 4 4 6 6 8 8 10 10 12
> x<=5
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
> x<y
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> x<- 10:1
> y<- -4:5
>x
[1] 10 9 8 7 6 5 4 3 2 1
16
>y
[1] -4 -3 -2 -1 0 1 2 3 4 5
> any(x<y)
[1] TRUE
> all(x<y)
[1] FALSE
> q<-
c("hockey","football","baseball","curling","rugby","lacrosse","basketball","tennis","cricket","soccer")
> nchar(q)
[1] 6 8 8 7 5 8 10 6 7 6
> nchar(y)
[1] 2 2 2 2 1 1 1 1 1 1
>x
[1] 10 9 8 7 6 5 4 3 2 1
> x[1]
[1] 10
> x[1:2]
[1] 10 9
> x{c(1,5)}
Error: unexpected '{' in "x{"
> x[c(1,5)]
[1] 10 6
> c(one="a",two="b",three="c")
one two three
"a" "b" "c"
> w<-1:3
> names(w)
NULL
> names(w)<-c("a","b","c")
>w
abc
123
CALLING A FUNCTION:
>x
[1] 10 9 8 7 6 5 4 3 2 1
> mean(x)
[1] 5.5
> mode(x)
[1] "numeric"
> median(x)
[1] 5.5
FUNCTION DOCUMENTATION:
> apropos("mea")
[1] ".colMeans" ".rowMeans" "colMeans"
[4] "influence.measures" "kmeans" "mean"
[7] "mean.Date" "mean.default" "mean.difftime"
17
[10] "mean.POSIXct" "mean.POSIXlt" "rowMeans"
[13] "weighted.mean"
> ?'+'
Missing Data: NA
> z<-c(1,2,NA,8,3,NA,3)
>z
[1] 1 2 NA 8 3 NA 3
> is.na(z)
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE
NULL:
> z<-c(1,NULL,3)
>z
[1] 1 3
> d<-NULL
> is.null(d)
[1] TRUE
Data Frames:
Data frame is just like an Excel spreadsheet in that it has column and rows. In statistical terms, each
column is a variable and each row is an observation.
> x<- 10:1
> y<--4:3
>x
[1] 10 9 8 7 6 5 4 3 2 1
>y
[1] -4 -3 -2 -1 0 1 2 3
> y<--4:5
>y
[1] -4 -3 -2 -1 0 1 2 3 4 5
> q<-
c("hockey","football","baseball","curling","rugby","lacrosse","basketball","tennis","cricket","soccer")
[1] "hockey" "football" "baseball" "curling" "rugby" "lacrosse"
[7] "basketball" "tennis" "cricket" "soccer"
> theDF<-data.frame(x,y,q)
> theDF
x y q
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> theDF<-data.frame(First=x, Second=y,Third=q)
18
> theDF
> theDF
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> nrow(theDF)
[1] 10
> NCOL(theDF)
[1] 3
> dim.data.frame(theDF)
[1] 10 3
> dim(theDF)
[1] 10 3
> names(theDF)
[1] "First" "Second" "Third"
> names(theDF) [3]
[1] "Third"
> rownames(theDF)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
> rownames(theDF)<-c("one", "two","threee","four","five","six","seven","eight","nine","ten")
> row.names(theDF)
[1] "one" "two" "threee" "four" "five" "six" "seven" "eight"
[9] "nine" "ten"
> theDF
First Second Third
one 10 -4 hockey
two 9 -3 football
threee 8 -2 baseball
four 7 -1 curling
five 6 0 rugby
six 5 1 lacrosse
seven 4 2 basketball
eight 3 3 tennis
nine 2 4 cricket
ten 1 5 soccer
> rownames(theDF) <-NULL
> rownames(theDF)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
> head(theDF)
19
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
> head(theDF, n=7)
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
> tail(theDF)
First Second Third
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> class(theDF)
[1] "data.frame"
> theDF
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
> theDF[3,2] #Third row, Second Column element
[1] -2
> theDF[3,2:3] # row 3, columns 2 through 3
Second Third
3 -2 baseball
> theDF[c(3,5),2] #rows 3 and 5, column 2
[1] -2 0
> theDF[c(3,5),2:3] # rows 3 and 5, column 2 through 3
Second Third
20
3 -2 baseball
5 0 rugby
> theDF$Third #only Third Column
[1] "hockey" "football" "baseball" "curling" "rugby" "lacrosse"
[7] "basketball" "tennis" "cricket" "soccer"
> theDF[,3]
[1] "hockey" "football" "baseball" "curling" "rugby" "lacrosse"
[7] "basketball" "tennis" "cricket" "soccer"
> theDF[,2:3] #column 2 through 3
Second Third
1 -4 hockey
2 -3 football
3 -2 baseball
4 -1 curling
5 0 rugby
6 1 lacrosse
7 2 basketball
8 3 tennis
9 4 cricket
10 5 soccer
> theDF[2,] #2 nd row
First Second Third
2 9 -3 football
> theDF[2:4,] # row 2 through 4
First Second Third
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
> theDF[,c("First","Third")] #access multiple column by name
First Third
1 10 hockey
2 9 football
3 8 baseball
4 7 curling
5 6 rugby
6 5 lacrosse
7 4 basketball
8 3 tennis
9 2 cricket
10 1 soccer
[[2]]
[1] 2
[[3]]
[1] 3
> list(c(1,2,3)) # creates a single element list where the only element is a vector that has 3 elements
[[1]]
[1] 1 2 3
[[2]]
[1] 3 4 5 6 7
#two element list , first element is a data.frame, second element is a 10 element vector
> list(theDF, 1:10)
[[1]]
First Second Third
1 10 -4 hockey
2 9 -3 football
3 8 -2 baseball
4 7 -1 curling
5 6 0 rugby
6 5 1 lacrosse
7 4 2 basketball
8 3 3 tennis
9 2 4 cricket
10 1 5 soccer
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
[[3]]
[[3]][[1]]
[1] 1 2 3
[[3]][[2]]
[1] 3 4 5 6 7
23
2) Interact data through .csv files (Import from and export to .csv files).
In R, we can read data from files stored outside the R environment. We can also write data into files
which will be stored and accessed by the operating system. R can read and write into various file formats
like csv, excel, xml etc.
Getting and Setting the Working Directory
You can check which directory the R workspace is pointing to using the getwd() function. You can also
set a new working directory using setwd()function.
> print(getwd())
[1] "C:/Users/Prasanna Kumar/Documents"
> data <- read.csv("2.csv") # Reading the Data from the .csv file
> print(data) # data
id name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Analyzing the CSV File
> print(ncol(data)) # number of columns
[1] 5
> print(nrow(data)) # number of rows
[1] 8
Get the maximum salary
>sal <- max(data$salary)
>print(sal)
[1] 843.25
Get all the people working in IT department
> retval <- subset( data, dept == "IT")
> retval
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
6 6 Nina 578.0 2013-05-21 IT
24
> info <- subset(data, salary > 600 & dept == "IT") # IT dept with salary>600
> print(info)
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
> retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
> retval
id name salary start_date dept
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance
Writing into a CSV File
R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This
file gets created in the working directory.
> write.csv(retval,"output.csv")
> newdata <- read.csv("output.csv")
> print(newdata)
X id name salary start_date dept
1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 5 Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance
25
3. Get and Clean data using swirl exercises. (Use ‘swirl’ package, library and install that
topic from swirl).
swirl is a software package for the R programming language that turns the R console into an interactive
learning environment. Users receive immediate feedback as they are guided through self-paced lessons
in data science and R programming.
The swirl R package makes it fun and easy to learn R programming and data science.
Step 1: Get R
In order to run swirl, you must have R 3.1.0 or later installed on your computer.
Step 2 (recommended): Get RStudio
In addition to R, it’s highly recommended that you install RStudio, which will make your experience with
R much more enjoyable.
Step 3: Install swirl
Open RStudio (or just plain R if you don't have RStudio) and type the following into the console:
> install.packages("swirl")
Note that the > symbol at the beginning of the line is R's prompt for you type something into the console.
Step 4: Start swirl
This is the only step that you will repeat every time you want to run swirl. First, you will load the package
using the library() function. Then you will call the function that starts the magic! Type the following,
pressing Enter after each line:
> library("swirl")
> swirl()
Step 5: Install an interactive course
The first time you start swirl, you'll be prompted to install a course. You can either install one of the
recommended courses or visit course repository for more options. There are even more courses available
from the Swirl Course Network.
If you'd like to install a course that is not part of our course repository, type ?InstallCourses at the R
prompt for a list of functions that will help to do so.
> library(swirl)
> swirl()
What shall I call you? Prasanna Kumar
... <-- That's your cue to press Enter to continue
Select 1, 2, or 3 and press Enter
1: Continue.
2: Proceed.
3: Let's get going!
Selection: 1
...
Selection: 1
Selection: 1
| Attempting to load lesson dependencies...
| This lesson requires the ‘dplyr’ package. Would you like me to install it for you
| now?
1: Yes
2: No
Selection: 1
package ‘purrr’ successfully unpacked and MD5 sums checked
package ‘generics’ successfully unpacked and MD5 sums checked
27
package ‘tidyselect’ successfully unpacked and MD5 sums checked
package ‘dplyr’ successfully unpacked and MD5 sums checked
> dim(mydf)
[1] 225468 11
| Great job!
|====== | 8%
| Now use head() to preview the data.
> head(mydf)
X date time size r_version r_arch r_os package version
1 1 2014-07-08 00:54:41 80589 3.1.0 x86_64 mingw32 htmltools 0.2.4
2 2 2014-07-08 00:59:53 321767 3.1.0 x86_64 mingw32 tseries 0.10-32
3 3 2014-07-08 00:47:13 748063 3.1.0 x86_64 linux-gnu party 1.0-15
4 4 2014-07-08 00:48:05 606104 3.1.0 x86_64 linux-gnu Hmisc 3.14-4
5 5 2014-07-08 00:46:50 79825 3.0.2 x86_64 linux-gnu digest 0.6.4
6 6 2014-07-08 00:48:04 77681 3.1.0 x86_64 linux-gnu randomForest 4.6-7
country ip_id
1 US 1
2 US 2
3 US 3
4 US 3
5 CA 4
6 US 3
> library(dplyr)
28
| Your dedication is inspiring!
> packageVersion("dplyr")
[1] ‘1.0.7’
...
|=========== | 15%
| The first step of working with data in dplyr is to load the data into what the
| package authors call a 'data frame tbl' or 'tbl_df'. Use the following code to
| create a new tbl_df called cran:
|
| cran <- tbl_df(mydf).
| Nice work!
|============ | 17%
| To avoid confusion and keep things running smoothly, let's remove the original
| data frame from your workspace with rm("mydf").
> rm("mydf")
> cran
# A tibble: 225,468 x 11
X date time size r_version r_arch r_os package version country ip_id
<int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 1 2014-0~ 00:54~ 8.06e4 3.1.0 x86_64 ming~ htmltoo~ 0.2.4 US 1
2 2 2014-0~ 00:59~ 3.22e5 3.1.0 x86_64 ming~ tseries 0.10-32 US 2
3 3 2014-0~ 00:47~ 7.48e5 3.1.0 x86_64 linu~ party 1.0-15 US 3
4 4 2014-0~ 00:48~ 6.06e5 3.1.0 x86_64 linu~ Hmisc 3.14-4 US 3
5 5 2014-0~ 00:46~ 7.98e4 3.0.2 x86_64 linu~ digest 0.6.4 CA 4
6 6 2014-0~ 00:48~ 7.77e4 3.1.0 x86_64 linu~ randomF~ 4.6-7 US 3
7 7 2014-0~ 00:48~ 3.94e5 3.1.0 x86_64 linu~ plyr 1.8.1 US 3
8 8 2014-0~ 00:47~ 2.82e4 3.0.2 x86_64 linu~ whisker 0.3-2 US 5
9 9 2014-0~ 00:54~ 5.93e3 NA NA NA Rcpp 0.10.4 CN 6
10 10 2014-0~ 00:15~ 2.21e6 3.0.2 x86_64 linu~ hflights 0.1 US 7
# .... with 225,458 more rows
...
|================ | 22%
| First, we are shown the class and dimensions of the dataset. Just below that, we
| get a preview of the data. Instead of attempting to print the entire dataset,
| dplyr just shows us the first 10 rows of data and only as many columns as fit
| neatly in our console. At the bottom, we see the names and classes for any
| variables that didn't fit on our screen.
...
|================= | 23%
| According to the "Introduction to dplyr" vignette written by the package authors,
| "The dplyr philosophy is to have small functions that each do one thing well."
| Specifically, dplyr supplies five 'verbs' that cover most fundamental data
| manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().
...
|================== | 25%
| Use ?select to pull up the documentation for the first of these core functions.
> ?select
|==================== | 27%
| Help files for the other functions are accessible in the same way.
...
|===================== | 28%
| As may often be the case, particularly with larger datasets, we are only
| interested in some of the variables. Use select(cran, ip_id, package, country) to
| select only the ip_id, package, and country variables from the cran dataset.
| That's correct!
|====================== | 30%
| The first thing to notice is that we don't have to type cran$ip_id, cran$package,
| and cran$country, as we normally would when referring to columns of a data frame.
| The select() function knows we are referring to columns of the cran dataset.
...
|======================= | 32%
| Also, note that the columns are returned to us in the order we specified, even
| though ip_id is the rightmost column in the original dataset.
...
|========================= | 33%
| Recall that in R, the `:` operator provides a compact notation for creating a
| sequence of numbers. For example, try 5:20.
31
4) Visualize all Statistical measures (Mean, Mode, Median, Range, Inter Quartile Range etc., using
Histograms, Boxplots and Scatter Plots).
MEAN
The mean of an observation variable is a numerical measure of the central location of the data values. It is
the sum of its data values divided by data count.
Hence, for a data sample of size n, its sample mean is defined as follows:
Problem
Find the mean eruption duration in the data set faithful.
>head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
Solution
We apply the mean function to compute the mean value of eruptions.
> duration = faithful$eruptions # the eruption durations
> mean(duration) # apply the mean function
[1] 3.4878
Answer
The mean eruption duration is 3.4878 minutes.
MEDIAN
The median of an observation variable is the value at the middle when the data is sorted in ascending order.
It is an ordinal measure of the central location of the data values.
Problem
Find the median of the eruption duration in the data set faithful.
Solution
We apply the median function to compute the median value of eruptions.
> duration = faithful$eruptions # the eruption durations
> median(duration) # apply the median function
[1] 4
Answer
The median of the eruption duration is 4 minutes.
MODE
It is the value that has the highest frequency in the given data set. The data set may have no mode if the
frequency of all data points is the same. Also, we can have more than one mode if we encounter two or
more data points having the same frequency. There is no inbuilt function for finding mode in R, so we can
create our own function for finding the mode or we can use the package called moodest.
32
>mode = function(){
return(sort(-table(faithful$eruption))[1])
}
>mode()
1.867
-8
QUARTILE
There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that
cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the
value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.
Problem
Find the quartiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the quartiles of eruptions.
> duration = faithful$eruptions # the eruption durations
> quantile(duration) # apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000
Answer
The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and 4.4543 minutes
respectively.
PERCENTILE
The nth percentile of an observation variable is the value that cuts off the first n percent of the data values
when it is sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles of eruptions with the desired percentage ratios.
> duration = faithful$eruptions # the eruption durations
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330
Answer
The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330 minutes
respectively.
RANGE
The range of an observation variable is the difference of its largest and smallest data values. It is a measure
of how far apart the entire data spreads in value.
Problem
Find the range of the eruption duration in the data set faithful.
Solution
33
We apply the max and min function to compute the largest and smallest values of eruptions, then take the
difference.
> duration = faithful$eruptions # the eruption durations
> max(duration) − min(duration) # apply the max and min functions
[1] 3.5
Answer
The range of the eruption duration is 3.5 minutes.
INTERQUARTILE RANGE
The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a
measure of how far apart the middle portion of data spreads in value.
Problem
Find the interquartile range of eruption duration in the data set faithful.
Solution
We apply the IQR function to compute the interquartile range of eruptions.
> duration = faithful$eruptions # the eruption durations
> IQR(duration) # apply the IQR function
[1] 2.2915
Answer
The interquartile range of eruption duration is 2.2915 minutes.
BOX PLOT
The box plot of an observation variable is a graphical representation based on its quartiles, as well as its
smallest and largest values. It attempts to provide a visual shape of the data distribution.
Problem
Find the box plot of the eruption duration in the data set faithful.
Solution
We apply the boxplot function to produce the box plot of eruptions.
> duration = faithful$eruptions # the eruption durations
> boxplot(duration, horizontal=TRUE) # horizontal box plot
Answer
The box plot of the eruption duration is:
34
HISTOGRAM
A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a
quantitative variable. The area of each bar is equal to the frequency of items found in each class.
Example
In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars
showing the number of eruptions classified according to their durations.
Problem
Find the histogram of the eruption durations in faithful.
Solution
We apply the hist function to produce the histogram of the eruptions variable.
> duration = faithful$eruptions
> hist(duration, right=FALSE) # apply the hist function , # intervals closed on the left
Answer
The histogram of the eruption durations is:
SCATTER PLOT
A scatter plot pairs up values of two quantitative variables in a data set and display them as geometric points
inside a Cartesian diagram.
Example
In the data set faithful, we pair up the eruptions and waiting values in the same observation as (x,y)
coordinates. Then we plot the points in the Cartesian plane. Here is a preview of the eruption data value
pairs with the help of the cbind function.
Enhanced Solution
We can generate a linear regression model of the two variables with the lm function, and then draw a trend
line with abline.
36
37
5) Create a data frame with the following structure.
EMP ID EMP NAME SALARY START DATE
1 Satish 5000 01-11-2013
2 Vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 Pallavi 4500 23-10-2000
> emp_id<-1:5
> emp_name<-c("Satish","Vani","Ramesh","Praveen","Pallavi")
> Salary<-c(5000,7500,10000,9500,4500
> d1<-as.Date("01-11-2013")
> d2<-as.Date("05-06-2011")
> d3<-as.Date("21-09-1999")
> d4<-as.Date("13-09-2005")
> d5<-as.Date("23-10-2000")
> Start_Date<-c(d1,d2,d3,d4,d5)
> theDF<-data.frame(emp_id,emp_name,Salary,Start_Date)
> theDF
emp_id emp_name Salary Start_Date
1 1 Satish 5000 2013-11-01
2 2 Vani 7500 2011-06-05
3 3 Ramesh 10000 1999-09-21
4 4 Praveen 9500 2005-09-13
5 5 Pallavi 4500 2000-10-23
c. Extract 3rd and 5th row with 2nd and 4th column.
> theDF[c(3,5),c(2,4)]
emp_name Start_Date
3 Ramesh 1999-09-21
5 Pallavi 2000-10-23
38
6) Write R Program using ‘apply’ group of functions to create and apply normalization function on
each of the numeric variables/columns of iris dataset to transform them into
i. 0 to 1 range with min-max normalization.
ii. a value around 0 with z-score normalization.
Min-Max Normalization
(X – min(X))/(max(X) – min(X))
For each value of a variable, we simply find how far that value is from the minimum value, then divide
by the range.
To implement this in R, we can define a simple function and then use lapply to apply that function to
whichever columns in the iris dataset we would like:
Notice that each of the columns now have values that range from 0 to 1. Also notice that the fifth column
“Species” was dropped from this data frame. We can easily add it back by using the following code:
The drawback of the min-max normalization technique is that it brings the data values towards the mean.
If we want to make sure that outliers get weighted more than other values, a z-score standardization is a
better technique to implement.
(X – μ) / σ
For each value of a variable, we simply subtract the mean value of the variable, then divide by the
standard deviation of the variable.
If we simply want to standardize one variable in a dataset, such as Sepal.Width in the iris dataset, we can
use the following code:
#standardize Sepal.Width
iris$Sepal.Width <- (iris$Sepal.Width - mean(iris$Sepal.Width)) / sd(iris$Sepal.Width)
head(iris)
The values of Sepal.Width are now scaled such that the mean is 0 and the standard deviation is 1. We can
even verify this if we’d like:
#[1] 1
To standardize several variables, we can simply use the scale function. For example, the following code
shows how to scale the first four columns of the iris dataset:
41
7) Create a data frame with 10 observations and 3 variables and add new rows and columns to it
using ‘rbind’ and ‘cbind’ function.
> rb<-c("Prakash",2011,15000,"Jeevan")
> rb
[1] "Prakash" "2011" "15000" "Jeevan"
> rb_df1<-rbind(cb_df1,rb) # rbind to add new row
> rb_df1
name married_year Salary father_name
1 Rahul 2016 10000 Gandhi
2 joe 2015 15000 Jashua
3 Adam 2016 12000 God
42
4 Brendon 2008 13000 Bush
5 Srilakshmi 2007 14000 Venkateswarlu
6 Prasanna Kumar 2009 15000 David
7 Anitha 2011 12000 Anand
8 Bhanu 2013 10000 Bharath
9 Rajesh 2014 11000 Rupesh
10 Priya 2008 14000 Prem Sagar
11 Prakash 2011 15000 Jeevan
43
8) Write R program to implement linear and multiple regression on ‘mtcars’ dataset to estimate the
value of ‘mpg’ variable, with best R2 and plot the original values in ‘green’ and predicted values in
‘red’.
Name Description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
qsec 1/4 mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburettors
If we are interested in the relationship between fuel efficiency (mpg) and weight (wt) we may start
plotting those variables with:
>plot(mpg ~ wt, data = mtcars, col=2)
44
The plots shows a (linear) relationship!. Then if we want to perform linear regression to determine the
coefficients of a linear model, we would use the lm function:
The ~ here means "explained by", so the formula mpg ~ wt means we are predicting mpg as explained by
wt. The most helpful way to view the output is with:
>summary(fit)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
the estimated slope of each coefficient (wt and the y-intercept), which suggests the best-fit
prediction of mpg is 37.2851 + (-5.3445) * wt
The p-value of each coefficient, which suggests that the intercept and weight are probably not due
to chance
Overall estimates of fit such as R^2 and adjusted R^2, which show how much of the variation in
mpg is explained by the model
45
We could add a line to our first plot to show the predicted mpg:
abline(fit,col=3,lwd=2)
It is also possible to add the equation to that plot. First, get the coefficients with coef. Then using paste0
we collapse the coefficients with appropriate variables and +/-, to built the equation. Finally, we add it to
the plot using mtext:
bs <- round(coef(fit), 3)
lmlab <- paste0("mpg = ", bs[1],
ifelse(sign(bs[2])==1, " + ", " - "), abs(bs[2]), " wt ")
mtext(lmlab, 3, line=-2)
In multiple regression there are more than one predictor variable and one response variable, relation of the
variables is shown below:
Y = a + b1x1 + b2x2 + ……… bnxn, Where Y is response, a…..bn are coefficients and x1…..xn are
predictor variables.
For this tutorial on Multiple Regression Analysis using R Programming, I am going to use mtcars dataset
and we will see How the Model is built for two and three predictor variables.
46
Case Study 1: Establishing Relationship between “mpg” as response variable and “disp”, “hp” as predictor
variables.
As you can see in the summary output shown above, we have got the intercept value which is the value of
‘a’ in the equation and coefficients of “disp” and “hp” are -0.030346 and -0.024840 respectively. Therefore
the regression analysis equation will be:
Using the above equation we can predict the value of mpg based on disp and hp.
47
>plot(model)
Output:
Case Study 2: Establishing Relationship between “mpg” as response variable and “disp”, “hp” and “wt”
as predictor variables.
48
9) Implement k-means clustering using R.
K Means is a clustering algorithm that repeatedly assigns a group amongst k groups present to a data point
according to the features of the point. It is a centroid-based clustering method.
Step 1
Iris dataset has 5 columns namely – Sepal length, Sepal width, Petal Length, Petal Width, and Species. Iris
is a flower and here in this dataset 3 of its species Setosa, Versicolor, Verginica are mentioned. We will
cluster the flowers according to their species. The code to load the dataset:
>data("iris")
>head(iris) #will show top 6 rows only
The next step is to separate the 3rd and 4th columns into separate object x as we are using the unsupervised
learning method. We are removing labels so that the huge input of petal length and petal width columns
will be used by the machine to perform clustering unsupervised.
Step 3
The next step is to use the K Means algorithm. K Means is the method we use which has parameters (data,
no. of clusters or groups). Here our data is the x object and we will have k=3 clusters as there are 3 species
in the dataset.
Then the ‘cluster’ package is called. Clustering in R is done using this inbuilt package which will perform
all the mathematics. Clusplot function creates a 2D graph of the clusters.
>model=kmeans(x,3)
49
>library(cluster)
>clusplot(x,model$cluster)
Component 1 and Component 2 seen in the graph are the two components in PCA (Principal Component
Analysis) which is basically a feature extraction method that uses the important components and removes
the rest. It reduces the dimensionality of the data for easier KMeans application. All of this is done by the
cluster package itself in R.
Step 4
The next step is to assign different colors to the clusters and shading them hence we use the color and shade
parameters setting them to T which means true.
>clusplot(x,model$cluster,color=T,shade=T)
50
10) Implement k-medoids clustering using R.
K-medoids clustering is a technique in which we place each observation in a dataset into one of K clusters.
The end goal is to have K clusters in which the observations within each cluster are quite similar to each
other while the observations in different clusters are quite different from each other.
In practice, we use the following steps to perform K-means clustering:
1. Choose a value for K.
First, we must decide how many clusters we’d like to identify in the data. Often we have to simply
test several different values for K and analyze the results to see which number of clusters seems to
make the most sense for a given problem.
2. Randomly assign each observation to an initial cluster, from 1 to K.
3. Perform the following procedure until the cluster assignments stop changing.
For each of the K clusters, compute the cluster centroid. This is the vector of the p feature medians
for the observations in the kth cluster.
Assign each observation to the cluster whose centroid is closest. Here, closest is defined using
Euclidean distance.
K-Medoids Clustering in R
Step 1: Load the Necessary Packages
First, we’ll load two packages that contain several useful functions for k-medoids clustering in R.
>library(factoextra)
>library(cluster)
The total within sum of squares will typically always increase as we increase the number of clusters, so
when we create this type of plot we look for an “elbow” where the sum of squares begins to “bend” or level
off.
The point where the plot bends is typically the optimal number of clusters. Beyond this number, overfitting
is likely to occur.
For this plot it appear that there is a bit of an elbow or “bend” at k = 4 clusters.
2. Number of Clusters vs. Gap Statistic
Another way to determine the optimal number of clusters is to use a metric known as the gap statistic,
which compares the total intra-cluster variation for different values of k with their expected values for a
distribution with no clustering.
We can calculate the gap statistic for each number of clusters using the clusGap() function from the cluster
package along with a plot of clusters vs. gap statistic using the fviz_gap_stat() function:
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(df,
FUN = pam,
52
K.max = 10, #max clusters to consider
B = 50) #total bootstrapped iterations
From the plot we can see that gap statistic is highest at k = 4 clusters, which matches the elbow method we
used earlier.
Step 4: Perform K-Medoids Clustering with Optimal K
Lastly, we can perform k-medoids clustering on the dataset using the optimal value for k of 4:
#make this example reproducible
set.seed(1)
#view results
kmed
Available components:
[1] "medoids" "id.med" "clustering" "objective" "isolation"
[6] "clusinfo" "silinfo" "diss" "call" "data"
Note that the four cluster centroids are actual observations in the dataset. Near the top of the output we can
see that the four centroids are the following states:
Alabama
Michigan
Oklahoma
New Hampshire
We can visualize the clusters on a scatterplot that displays the first two principal components on the axes
using the fivz_cluster() function:
#plot results of final k-medoids model
fviz_cluster(kmed, data = df)
We can also append the cluster assignments of each state back to the original dataset:
#add cluster assignment to original data
final_data <- cbind(USArrests, cluster = kmed$cluster)
54
#view final data
head(final_data)
55
11) implement density based clustering on iris dataset.
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris versicolor)
and a multivariate dataset introduced by British statistician and biologist Ronald Fisher in his 1936 paper
The use of multiple measurements in taxonomic problems. Four features were measured from each sample
i.e length and width of the sepals and petals and based on the combination of these four features, Fisher
developed a linear discriminant model to distinguish the species from each other.
# Loading data
>data(iris)
# Structure
>str(iris)
Using the DBScan Clustering algorithm on the dataset which includes 11 persons and 6 variables or
attributes
# Installing Packages
>install.packages("fpc")
# Loading package
>library(fpc)
# Remove label form dataset
>iris_1 <- iris[-5]
# Fitting DBScan clustering Model
# to training dataset
>set.seed(220) # Setting seed
>Dbscan_cl <- dbscan(iris_1, eps = 0.45, MinPts = 5)
>Dbscan_cl
# Checking cluster
Dbscan_cl$cluster
# Table
>table(Dbscan_cl$cluster, iris$Species)
56
# Plotting Cluster
>plot(Dbscan_cl, iris_1, main = "DBScan")
57
12) implement decision trees using ‘readingSkills’ dataset.
Format
A data frame with 200 observations on the following 4 variables.
nativeSpeaker: a factor with levels no and yes, where yes indicates that the child is a native
speaker of the language of the reading test.
age : age of the child in years.
shoeSize: shoe size of the child in cm.
score: raw score on the reading test.
Step 3: Splitting dataset into 4:1 ratio for train and test data
>sample_data = sample.split(readingSkills, SplitRatio = 0.8)
## sample_data<- sample(2,nrow(readingSkills), replace=TRUE, prob=c(0.8,0.2))
>train_data <- subset(readingSkills, sample_data == TRUE)
>test_data <- subset(readingSkills, sample_data == FALSE)
Step 4: Create the decision tree model using ctree and plot the model
>model<- ctree(nativeSpeaker ~ ., train_data)
>plot(model)
The basic syntax for creating a decision tree in R is:
>ctree(formula, data)
where, formula describes the predictor and response variables and data is the data set used. In this case
nativeSpeaker is the response variable and the other predictor variables are represented by ., hence when
we plot the model we get the following output.
Output:
58
Step 5: Making a prediction
# testing the people who are native speakers and those who are not
>predict_model<-predict(ctree_, test_data)
# creates a table to count how many are classified as native speakers and how many are not
>m_at <- table(test_data$nativeSpeaker, predict_model)
>m_at
The model has correctly predicted 13 people to be non-native speakers but classified an additional 13 to
be non-native, and the model by analogy has misclassified none of the passengers to be native speakers
when actually they are not.
Step 6: Determining the accuracy of the model developed
>ac_Test <- sum(diag(table_mat)) / sum(table_mat)
>print(paste('Accuracy for test is found to be', ac_Test))
Here the accuracy-test from the confusion matrix is calculated and is found to be 0.74.
Hence this model is found to predict with an accuracy of 74 %
59
13) Implement decision trees using ‘iris’ dataset using package party and ‘rpart’. (Recursive
Partitioning and Regression Trees)
Fit a rpart model
Usage:
rpart(formula, data, weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)
Arguments
formula: a formula, with a response but no interaction terms. If this a a data frame, that is taken as the
model frame.
data : an optional data frame in which to interpret the variables named in the formula.
weights: optional case weights.
subset : optional expression saying that only a subset of the rows of the data should be used in the fit.
na.action: the default action deletes all observations for which y is missing, but keeps those in which one
or more predictors are missing.
method: one of "anova", "poisson", "class" or "exp". If method is missing then the routine tries to make
an intelligent guess. If y is a survival object, then method = "exp" is assumed, if y has 2 columns then
method = "poisson" is assumed, if y is a factor then method = "class" is assumed, otherwise method =
"anova" is assumed. It is wisest to specify the method directly, especially as more criteria may added to
the function in future.
Alternatively, method can be a list of functions named init, split and eval. Examples are given in the file
‘tests/usersplits.R’ in the sources, and in the vignettes ‘User Written Split Functions’.
model : if logical: keep a copy of the model frame in the result? If the input value for model is a model
frame (likely from an earlier call to the rpart function), then this frame is used rather than constructing
new data.
x : keep a copy of the x matrix in the result.
y : keep a copy of the dependent variable in the result. If missing and model is supplied this defaults
to FALSE.
parms : optional parameters for the splitting function.
Anova splitting has no parameters.
Poisson splitting has a single parameter, the coefficient of variation of the prior distribution on the rates.
The default value is 1.
Exponential splitting has the same parameter as Poisson.
For classification splitting, the list can contain any of: the vector of prior probabilities (component prior),
the loss matrix (component loss) or the splitting index (component split). The priors must be positive and
sum to 1. The loss matrix must have zeros on the diagonal and positive off-diagonal elements. The
splitting index can be gini or information. The default priors are proportional to the data counts, the losses
default to 1, and the split defaults to gini.
control : a list of options that control details of the rpart algorithm. See rpart.control.
cost : a vector of non-negative costs, one for each variable in the model. Defaults to one for all
variables. These are scalings to be applied when considering splits, so the improvement on splitting on a
variable is divided by its cost in deciding which split to choose.
...
arguments to rpart.control may also be specified in the call to rpart. They are checked against the list of
valid arguments.
Details: This differs from the tree function in S mainly in its handling of surrogate variables. In most
details it follows Breiman et. al (1984) quite closely. R package tree provides a re-implementation of tree.
Value: An object of class rpart.
60
Program:
> library(rpart)
> install.packages('rpart.plot')
> library(rpart.plot)
>data<-iris
>head(data)
> dt3 = rpart(Species ~., control = rpart.control( minsplit = 10, maxdepth = 5),data=iris , method =
"poisson")
> dt3
n= 150
> dt3
n= 150
61
>rpart.plot(dt3,type=4,extra=1)
62
14. Use a Corpus() function to create a data corpus then Build a term Matrix and Reveal word
frequencies
corpus: Text Corpus Analysis
Text corpus data analysis, with full support for international text (Unicode). Functions for reading data
from newline-delimited 'JSON' files, for normalizing and tokenizing text, for searching for term
occurrences, and for computing term occurrence frequencies, including n-grams.
> install.packages("corpus")
package ‘corpus’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Prasanna Kumar\AppData\Local\Temp\Rtmps7sf41\downloaded_packages
> library(corpus)
> help(corpus)
The Corpus Package
Text corpus analysis functions Details:
This package contains functions for text corpus analysis. To create a text object, use the read_ndjson or
as_corpus_text function. To split text into sentences or token blocks, use text_split. To specify
preprocessing behavior for transforming a text into a token sequence, use text_filter. To tokenize text or
compute term frequencies, use text_tokens, term_stats or term_matrix. To search for or count specific
terms, use text_locate, text_count, or text_detect.
term_matrix {corpus}
Description: Tokenize a set of texts and compute a term frequency matrix.
Usage:
term_matrix(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, transpose = FALSE, ...)
term_counts(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, ...)
Arguments
x : a text vector to tokenize.
filter : if non-NULL, a text filter to to use instead of the default text filter for x.
ngrams: an integer vector of n-gram lengths to include, or NULL to use the select argument to determine
the n-gram lengths.
select :a character vector of terms to count, or NULL to count all terms that appear in x.
group : if non-NULL, a factor, character string, or integer vector the same length of x specifying the
grouping behavior.
transpose: a logical value indicating whether to transpose the result, putting terms as rows instead of
columns.
... : additional properties to set on the text filter.
Details:
term_matrix tokenizes a set of texts and computes the occurrence counts for each term, returning the result
as a sparse matrix (texts-by-terms). term_counts returns the same information, but in a data frame.
If ngrams is non-NULL, then multi-type n-grams are included in the output for all lengths appearing in the
ngrams argument. If ngrams is NULL but select is non-NULL, then all n-grams appearing in the select set
are included. If both ngrams and select are NULL, then only unigrams (single type terms) are included.
If group is NULL, then the output has one set of term counts for each input text. Otherwise, we convert
group to a factor and compute one set of term counts for each level. Texts with NA values for group get
skipped.
Value:
63
term_matrix with transpose = FALSE returns a sparse matrix in "dgCMatrix" format with one column for
each term and one row for each input text or (if group is non-NULL) for each grouping level. If filter$select
is non-NULL, then the column names will be equal to filter$select. Otherwise, the columns are assigned in
arbitrary order.
term_matrix with transpose = TRUE returns the transpose of the term matrix, in "dgCMatrix" format.
term_counts with group = NULL returns a data frame with one row for each entry of the term matrix, and
columns "text", "term", and "count" giving the text ID, term, and count. The "term" column is a factor with
levels equal to the selected terms. The "text" column is a factor with levels equal to
names(as_corpus_text(x)); calling as.integer on the "text" column converts from the factor values to the
integer row index in the term matrix.
term_counts with group non-NULL behaves similarly, but the result instead has columns named "group",
"term", and "count", with "group" giving the grouping level, as a factor.
Examples
text <- c("A rose is a rose is a rose.",
"A Rose is red, a violet is blue!",
"A rose by any other name would smell as sweet.")
term_matrix(text)
# data frame
head(term_counts(text), n = 10) # first 10 rows
# with grouping
term_counts(text, group = c("Good", "Bad", "Good"))
64
VIVA QUESTIONS
1. What is R Programming?
2. What are the different data objects in R?
3. What makes a valid variable name in R?
4. What is the main difference between an Array and a matrix?
5. Which data object in R is used to store and process categorical data?
6. How can you load and use csv file in R?
7. How do you get the name of the current working directory in R?
8. What is R Base package?
9. How R is used in logistic regression?
10. How do you access the element in the 2nd column and 4th row of a matrix named M?
11. What is recycling of elements in a vector? Give an example.
12. What are different ways to call a function in R?
13. What is lazy function evaluation in R?
14. How do you install a package in R?
15. Name a R packages which is used to read XML files.
16. Can we update and delete any of the elements in a list?
17. Give the general expression to create a matrix in R.
18. which function is used to create a boxplot graph in R?
19. In doing time series analysis, what does frequency = 6 means in the ts() function?
20. What is reshaping of data in R?
21. What is the output of runif(4)?
22. How to get a list of all the packages installed in R ?
23. What is expected from running the command - strsplit(x,"e")?
24. Give a R script to extract all the unique words in uppercase from the string - "The quick brown
fox jumps over the lazy dog".
25. Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[1]?
26. Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[[1]]?
27. What does unlist() do?
28. Give the R expression to get 26 or less heads from a 51 tosses of a coin using pbinom.
29. X is the vector c(5,9.2,3,8.51,NA), What is the output of mean(x)?
30. How do you convert the data in a JSON file to a data frame?
31. Give a function in R that replaces all missing values of a vector x with the sum of elements of that
vector?
32. What is the use of apply() in R?
33. Is an array a matrix or a matrix an array?
34. How to find the help page on missing values?
35. How do you get the standard deviation for a vector x?
36. How do you set the path for current working directory in R?
37. What is the difference between "%%" and "%/%"?
38. What does col.max(x) do?
39. Give the command to create a histogram.
40. How do you remove a vector from the R workspace?
41. List the data sets available in package "MASS"
42. List the data sets available in all available packages.
65
43. What is the use of the command - install.packages(file.choose(), repos=NULL)?
44. Give the command to check if the element 15 is present in vector x.
45. Give the syntax for creating scatterplot matrices.
46. What is the difference between subset() function and sample() function in R?
47. How do you check if "m" is a matrix data object in R?
48. What is the output for the below expression all(NA==NA)?
49. How to obtain the transpose of a matrix in R?
50. What is the use of "next" statement in R?
70
Data Characterization is s summarization of the general features of a target class of data. Example,
analyzing software product with sales increased by 10%
What is data discrimination?
Data discrimination is the comparison of the general features of the target class objects against one or
more contrasting objects.
What can business analysts gain from having a data warehouse?
First, having a data warehouse may provide a competitive advantage by presenting relevant
information from which to measure performance and make critical adjustments in order to help win
over competitors.
Second, a data warehouse can enhance business productivity because it is able to quickly
and efficiently gather information that accurately describes the organization.
Third, a data warehouse facilitates customer relationship management because it provides a
consistent view of customers and item across all lines of business, all departments and all
markets.
Finally, a data warehouse may bring about cost reduction by tracking trends, patterns, and
exceptions over long periods in a consistent and reliable manner.
Why is association rule necessary?
In data mining, association rule learning is a popular and well researched method for discovering
interesting relations between variables in large databases.
It is intended to identify strong rules discovered in database using different measures of interesting.
What are two types of data mining tasks?
Descriptive task
Predictive task
Define classification.
Classification is the process of finding a model (or function) that describes and distinguishes data
classes or concepts.
What are outliers?
A database may contain data objects that do not comply with the general behavior or model of the
data. These data objects are called outliers.
What do you mean by evolution analysis?
Data evolution analysis describes and models regularities or trends for objects whose behavior change
over time.
Although this may include characterization, discrimination, association and
correlation analysis, classification, prediction, or clustering of time related data.
Distinct features of such as analysis include time-series data analysis, sequence or periodicity
pattern matching, and similarity-based data analysis.
Define KDD.
The process of finding useful information and patterns in data.
What are the components of data mining?
Database, Data Warehouse, World Wide Web, or other information repository
ØDatabase or Data Warehouse Server
ØKnowledge Based
ØData Mining Engine
ØPattern Evaluation Module
ØUser Interface
Define metadata.
A database that describes various aspects of data in the warehouse is called metadata.
71
What are the usage of metadata?
ØMap source system data to data warehouse tables
ØGenerate data extract, transform, and load procedures for import jobs
ØHelp users discover what data are in the data warehouse
ØHelp users structure queries to access data they need
List the demerits of distributed data warehouse.
ØThere is no metadata, no summary data or no individual DSS (Decision Support System)
integration or history. All queries must be repeated, causing additional burden on the system.
ØSince compete with production data transactions, performance can be degraded.
ØThere is no refreshing process, causing the queries to be very complex.
Define HOLAP.
The hybrid OLAP approach combines ROLAP and MOLAP technology.
What are data mining techniques?
Association rules
Classification and prediction
Clustering
Deviation detection
Similarity search
Sequence Mining
List different data mining tools.
Traditional data mining tools
Dashboards
Text mining tools
Define sub sequence.
A subsequence, such as buying first a PC, the a digital camera, and then a memory card, if it occurs
frequently in a shopping history database, is a (frequent) sequential pattern.
What is data warehouse?
A data warehouse is a electronic storage of an Organization’s historical data for the purpose of
reporting, analysis and data mining or knowledge discovery.
72