Data Mining Lab 1
Data Mining Lab 1
Normalize
marks out of 5
(5)
Objective:
1. To be familiar with modern programming languages and software environments for
data mining (DM).
2. To be familiar with a modern programming language named R
3. To know how to make a program using R and then run its application.
Theory:
1
http://www.indianjournals.com/ijor.aspx?target=ijor:ijsrnsc&volume=7&issue=2&article=003
2
https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1309
these data mining tools into nine different categories or types including: data mining software
(DMS or DAS), Business Intelligence packages (BI), mathematical packages (MAT), Integration
packages (INT), extensions (EXT), data mining libraries (LIB), specialties (SPEC), research
prototypes (RES) and solutions (SOL).
2. Introduction to R:
R is a language and environment for statistical computing and graphics, similar to the S
language originally developed by John Chambers at Bell Labs in 1970s. It was not created by
software engineers for software development. Instead, it was developed by statisticians as an
interactive environment for data analysis. For more information or full history please refer to
the paper “A Brief History of S” by Becker, R.A. (1994)3. The interactivity is an indispensable feature
in data science because, as you will soon learn, the ability to quickly explore data is a necessity
for success in this field.
The first version of R was developed by Robert Gentleman and Ross Ihaka at the University
of Auckland in the mid-1990s. They wanted a better statistical software in their Macintosh
teaching laboratory as open source alternative.
2.1 Why Use R:
R has many features to recommend it:
1. Most commercial statistical software platforms cost thousands, if not tens of thousands,
of dollars. R is free! If you’re a teacher or a student, the benefits are obvious.
2. R is a comprehensive statistical platform, offering all manner of data-analytic
techniques. Just about any type of data analysis can be done in R.
3. R contains advanced statistical routines not yet available in other packages. In fact, new
methods become available for download on a weekly basis. If you’re a SAS user,
imagine getting a new SAS PROC every few days.
4. R has state-of-the-art graphics capabilities. If you want to visualize complex data, R
has the most comprehensive and powerful feature set available.
5. R is a powerful platform for interactive data analysis and exploration. For example, the
results of any analytic step can easily be saved, manipulated, and used as input for
additional analyses.
6. Getting data into a usable form from multiple sources can be a challenging proposition.
R can easily import data from a wide variety of sources, including text files, database-
management systems, statistical packages, and specialized data stores. It can write data
out to these systems as well. R can also access data directly from web pages, social
media sites, and a wide range of online data services.
7. R provides an unparalleled platform for programming new statistical methods in an
easy, straightforward manner. It’s easily extensible and provides a natural language for
quickly programming recently published methods.
8. R functionality can be integrated into applications written in other languages, including
C++, Java, Python, PHP, Pentaho, SAS, and SPSS. This allows you to continue working
in a language that you may be familiar with, while adding R’s capabilities to your
applications.
9. R runs on a wide array of platforms, including Windows, Unix, and Mac OS X. It’s
likely to run on any computer you may have. (I’ve even come across guides for
installing R on an iPhone, which is impressive but probably not a good idea.)
10. If you don’t want to learn a new language, a variety of graphic user interfaces (GUIs)
are available, offering the power of R through menus and dialogs.
3
https://www.semanticscholar.org/paper/A-Brief-History-of-S-
Becker/6c624ce83b11c6d19c1a7c5040ff6dafd7b185c7
2.2 Getting Started with R:
To start R, double-click on the icon on your desktop. It will open R Console window in
your PC as shown in image. Interactive data analysis usually occurs on the R console that
executes commands as you type them. You can enter commands one at a time at the
command prompt (>) or run a set of commands from a source file.
2.2.1 Scripts:
One of the great advantages of R over point-and-click analysis software is that you can save
your work as scripts. You can edit and save these scripts using a text editor.
2.2.2 Data Types:
There are a wide variety of data types, including vectors, matrices, data frames (similar to
datasets), and lists (collections of objects)
2.2.3 Objects, Class and Statements:
Most functionality is provided through built-in and user-created functions and the creation
and manipulation of objects. An object is basically anything that can be assigned a value.
For R, that is just about everything (data, functions, graphs, analytic, results, and more).
Every object has a class attribute telling R how to handle it. All objects are kept in memory
during an interactive session. Basic functions are available by default. Other functions are
contained in packages that can be attached to a current session as needed.
Statements consist of functions and assignments. R uses the symbol <- for assignments,
rather than the typical = sign.
For example, the statement x <- rnorm(5) creates a vector object named x containing five
random deviates from a standard normal distribution.
2.2.4 Getting Help:
R provides extensive help facilities and learning to navigate them will help you
significantly in your programming efforts. The built-in help system provides details,
references, and examples of any function contained in a currently installed package. You
can obtain help using the functions listed in table 1.
Table 1 R help functions
The current working directory is the directory from which R will read files and to which it
will save results by default. You can find out what the current working directory is by using
the getwd() function. You can set the current working directory by using the setwd()
function. If you need to input a file that isn’t in the current working directory, use the full
pathname in the call. Always enclose the names of files and directories from the operating
system in quotation marks. Some standard commands for managing your workspace are
listed in table 2
Table 2. Functions for managing the R workspace
2.2.6 Input and Output:
By default, launching R starts an interactive session with input from the keyboard and
output to the screen. But you can also process commands from a script file (a file containing
R statements) and direct output to a variety of destinations.
INPUT:
The source("filename") function submits a script to the current session. If the filename
doesn’t include a path, the file is assumed to be in the current working directory. For
example, source("myscript.R") runs a set of R statements contained in the file myscript.R.
By convention, script filenames end with an .R extension, but this isn’t required.
TEXT OUTPUT:
The sink("filename") function redirects output to the file filename. By default, if the file
already exists, its contents are overwritten. Include the option append=TRUE to append text
to the file rather than overwriting it. Including the option split=TRUE will send output to
both the screen and the output file. Issuing the command sink() without options will return
output to the screen alone.
GRAPHIC OUTPUT:
Although sink() redirects text output, it has no effect on graphic output. To redirect graphic
output, use one of the functions listed in table 1.4. Use dev.off() to return output to the
terminal.
Table 3 Functions for saving graphic output
2.3 Packages in R:
R comes with extensive capabilities right out of the box. But some of its most exciting
features are available as optional modules that you can download and install. There are
more than 5,500 user-contributed modules called packages that you can download from
http://cran.r-project.org/web/packages. They provide a tremendous range of new
capabilities, from the analysis of geospatial data to protein mass spectra processing to the
analysis of psychological tests! You’ll use many of these optional packages in this
semester.
Packages are collections of R functions, data, and compiled code in a well-defined format.
The directory where packages are stored on your computer is called the library. The
function libPaths() shows you where your library is located, and the function library() shows
you what packages you’ve saved in your library. R comes with a standard set of packages
(including base, datasets, utils, grDevices, graphics, stats, and methods). They provide a
wide range of functions and datasets that are available by default. Other packages are
available for download and installation. Once installed, they must be loaded into the session
in order to be used. The command search() tells you which packages are loaded and ready
to use.
2.3.1 Installing a Package:
A number of R functions let you manipulate packages. To install a package for the first
time, use the install.packages() command. For example, install.packages() without options
brings up a list of CRAN mirror sites. Once you select a site, you’re presented with a list
of all available packages. Selecting one downloads and installs it.
If you know what package you want to install, you can do so directly by providing it as an
argument to the function. For example, the gclus package contains functions for creating
enhanced scatter plots. You can download and install the package with the command
install.packages("gclus").
You only need to install a package once. But like any software, packages are often updated
by their authors. Use the command update.packages() to update any packages that you’ve
installed. To see details on your packages, you can use the installed.packages() command.
It lists the packages you have, along with their version numbers, dependencies, and other
information.
2.3.2 Loading a Package:
Installing a package downloads it from a CRAN mirror site and places it in your library.
To use it in an R session, you need to load the package using the library() command. For
example, to use the package gclus, issue the command library(gclus). Of course, you must
have installed a package before you can load it. You’ll only have to load the package once
in a given session. If desired, you can customize your startup environment to automatically
load the packages you use most often.
2.3.3 Learning about a Package:
When you load a package, a new set of functions and datasets becomes available. Small
illustrative datasets are provided along with sample code, allowing you to try out the new
functionalities. The help system contains a description of each function (along with
examples) and information about each dataset included. Entering
help(package="package_name") provides a brief description of the package and an index
of the functions and datasets included. Using help() with any of these function or dataset
names provides further details. The same information can be downloaded as a PDF manual
from CRAN.
2.4 Basic Plotting:
The plot() function, allowing you to visually inspect the trend.
The scatter plot in plot( ) function is informative but somewhat utilitarian and unattractive.
To get a sense of what R can do graphically, enter demo() at the command prompt. A sample
of the graphs produced is included in figure 2. Other demonstrations include
demo(Hershey), demo(persp), and demo(image). To see a complete list of demonstrations,
enter demo() without parameters. A sample of the graphs produced using demo ( ) is given
in figure 3.
Fig 2. Scatter plot of weight vs age
A dataset is usually a rectangular array of data with rows representing observations and
columns representing variables.
A data structure is the type of data that an object can hold R has a wide variety of objects
for holding data, including scalars, vectors, matrices, arrays, data frames, and lists. They
differ in terms of the type of data they can hold, how they’re created, their structural
complexity, and the notation used to identify and access individual elements. Figure 4
shows a diagram of these data structures to understand concept of these different data
structures.
Perhaps the simplest way to enter data is from the keyboard. There are two common
methods: entering data through R’s built-in text editor and embedding data directly into
your code. We’ll consider the editor first.
(A) The edit( ) function in R invokes a text editor that lets you enter data manually.
Here are the steps:
1 Create an empty data frame (or matrix) with the variable names and modes you
want to have in the final dataset.
2 Invoke the text editor on this data object, enter your data, and save the results
to the data object.
The following example creates a data frame named mydata with three variables: age
(numeric), gender (character), and weight (numeric). You then invoke the text editor,
add your data, and save the results:
mydata <- data.frame(age=numeric(0),
gender=character(0), weight=numeric(0))
mydata <- edit(mydata)
(B) Alternatively, you can embed the data directly in your program. For example, the code
mydatatxt <- "
age gender weight
25 m 166
30 f 115
18 f 120
"
mydata <- read.table(header=TRUE, text=mydatatxt)
creates the same data frame as that created with the edit() function. A character string is
created containing the raw data, and the read.table() function is used to process the string
and return a data frame. The read.table() function is described more fully in the next section.
2.7 Quitting R:
The q() function ends the session and lets you quit.
2.8 Common mistakes in R:
Lab Task:
1. To be familiar with GUI of modern programming language named R
2. To know how to make a program using R and make program using R functions to
summarize data with analytical measures (mean, mode, median etc ) of basic statistics.
Apparatus:
Laptop
R
Experimental Procedure:
1. How to Setup R:
1. Start R by double-click on the R icon on your desktop. It will open following windows
in your PC as shown in image.
2. Create data object named testdata. Set up this object as data frame.
> testdata <- data.frame()
> testdata
data frame with 0 columns and 0 rows
3. Add two variables Sex(character) and Age(numeric) into testdata object.
> testdata <- data.frame()(Sex=character(0), Age=numeric(0))
4. Add data given below to make a list matching as data list provided below.
12. Quit R
> q()
EXPERIMENT DOMAIN:
OBJECTIVE:
APPARATUS:
PROCEDURE:
(Note: Use all steps you studied in LAB SESSION of this tab to write procedure and to
complete the experiment)
DISCUSSION:
Q1.: How you can display variable value in the Command Window?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Conclusion /Summary
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________