Data Mining Lab 2
Data Mining Lab 2
named R.
Course title: Soft Computing and Data mining Lab Total Marks: ___20_________
Practical No. 2 Date of experiment performed: ____________
Course teacher/Lab Instructor: Engr. Muhammad Usman Date of marking: ____________
Student Name:__________________________
Registration no.__________________________
Normalize
marks out of 5
(5)
Objective:
1. To be familiar with data structures of modern programming language named R.
2. To be familiar with basic manipulation functions of modern programming language
named R
3. To know how to use basic manipulation functions on different structures of modern
programming language named R.
Theory:
First you create a 5 × 4 matrix b. Then you create a 2 × 2 matrix with labels and fill the matrix
by rows c. Finally, you create a 2 × 2 matrix and fill the matrix by columns d. You can identify
rows, columns, or elements of a matrix by using subscripts and brackets. X[i,] refers to the ith
row of matrix X, X[,j] refers to the j th column, and X[i, j] refers to the ij th element,
respectively. The subscripts i and j can be numeric vectors in order to select multiple rows or
columns, as shown in the following listing.
In 2.2 above, First a 2 × 5 matrix is created containing the numbers 1 to 10. By default, the
matrix is filled by column. Then the elements in the second row are selected, followed by the
elements in the second column. Next, the element in the first row and fourth column is selected.
Finally, the elements in the first row and the fourth and fifth columns are selected.
Matrices are two-dimensional and, like vectors, can contain only one data type. When there are
more than two dimensions, you use arrays (section 1.3). When there are multiple modes of
data, you use data frames (section 1.4).
1.3 Arrays:
Arrays are similar to matrices but can have more than two dimensions. They’re created
with an array function of the following form
where vector contains the data for the array, dimensions is a numeric vector giving the maximal
index for each dimension, and dimnames is an optional list of dimension labels. The following
listing gives an example of creating a three-dimensional (2 × 3 × 4) array of numbers.
where col1, col2, col3, and so on are column vectors of any type (such as character, numeric,
or logical). Names for each column can be provided with the names function.
Each column must have only one mode, but you can put columns of different modes together
to form the data frame. Because data frames are close to what analysts typically think of as
datasets, we’ll use the terms columns and variables interchangeably when discussing data
frames.
There are several ways to identify the elements of a data frame. You can use the subscript
notation you used before (for example, with matrices), or you can specify column names. Using
the patientdata data frame created earlier, the following listing demonstrates these approaches.
It can get tiresome typing patientdata$ at the beginning of every variable name, so shortcuts
are available. You can use either the attach() and detach() or with() functions to simplify your
code.
1.5 Factors:
As you’ve seen, variables can be described as nominal, ordinal, or continuous. Nominal
variables are categorical, without an implied order. Diabetes (Type1, Type2) is an example of
a nominal variable. Even if Type1 is coded as a 1 and Type2 is coded as a 2 in the data, no
order is implied. Ordinal variables imply order but not amount. Status (poor, improved,
excellent) is a good example of an ordinal variable. You know that a patient with a poor status
isn’t doing as well as a patient with an improved status, but not by how much. Continuous
variables can take on any value within some range, and both order and amount are implied.
Age in years is a continuous variable and can take on values such as 14.5 or 22.8 and any value
in between. You know that someone who is 15 is one year older than someone who is 14.
Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors.
Factors are crucial in R because they determine how data is analyzed and presented visually.
The function factor() stores the categorical values as a vector of integers in the range [1… k],
(where k is the number of unique values in the nominal variable) and an internal vector of
character strings (the original values) mapped to these integers.
For example, assume that you have this vector:
The statement diabetes <- factor(diabetes) stores this vector as (1, 2, 1, 1) and associates it with
1 = Type1 and 2 = Type2 internally (the assignment is alphabetical). Any analyses performed
on the vector diabetes will treat the variable as nominal and select the statistical methods
appropriate for this level of measurement.
For vectors representing ordinal variables, you add the parameter ordered=TRUE to
the factor() function. Given the vector
the statement status <- factor (status, ordered=TRUE) will encode the vector as (3, 2, 1, 3) and
associate these values internally as 1 = Excellent, 2 = Improved, and 3 = Poor. Additionally,
any analyses performed on this vector will treat the variable as ordinal and select the statistical
methods appropriately.
By default, factor levels for character vectors are created in alphabetical order. This worked for
the status factor, because the order “Excellent,” “Improved,” “Poor” made sense. There would
have been a problem if “Poor” had been coded as “Ailing” instead, because the order would
have been “Ailing,” “Excellent,” “Improved.” A similar problem would exist if the desired
order was “Poor,” “Improved,” “Excellent.” For ordered factors, the alphabetical default is
rarely sufficient. You can override the default by specifying a levels option. For example,
status <- factor(status, order=TRUE,
levels=c("Poor", "Improved", "Excellent"))
assigns the levels as 1 = Poor, 2 = Improved, 3 = Excellent. Be sure the specified levels match
your actual data values. Any data values not in the list will be set to missing. Numeric variables
can be coded as factors using the levels and labels options. If sex was coded as 1 for male and
2 for female in the original data, then
The following listing demonstrates how specifying factors and ordered factors impacts data
analyses.
In listing 2.6 above, First you enter the data as vectors b. Then you specify that diabetes is a
factor and status is an ordered factor. Finally, you combine the data into a data frame. The
function str(object) provides information about an object in R (the data frame, in this case) c. It
clearly shows that diabetes is a factor and status is an ordered factor, along with how they’re
coded internally. Note that the summary() function treats the variables differently d. It provides
the minimum, maximum, mean, and quartiles for the continuous variable age, and frequency
counts for the categorical variables diabetes and status.
1.6 Lists:
Lists are the most complex of the R data types. Basically, a list is an ordered collection of
objects (components). A list allows you to gather a variety of (possibly unrelated) objects under
one name. For example, a list may contain a combination of vectors, matrices, data frames, and
even other lists. You create a list using the list() function
mylist <- list(object1, object2, ...)
where the objects are any of the structures seen so far. Optionally, you can name the
objects in a list:
mylist <- list(name1=object1, name2=object2, ...)
The following listing shows an example.
In this example, you create a list with four components: a string, a numeric vector, a matrix,
and a character vector. You can combine any number of objects and save them as a list. You
can also specify elements of the list by indicating a component number or a name within double
brackets.
In this example, mylist[[2]] and mylist[["ages"]] both refer to the same four-element numeric
vector. For named components, mylist$ages would also work. Lists are important R structures
for two reasons. First, they allow you to organize and recall disparate information in a simple
way. Second, the results of many R functions return lists. It’s up to the analyst to pull out the
components that are needed.
here are a lot of built-in function in R. R matches your input parameters with its function
arguments, either by value or by position, then executes the function body. Function arguments
can have default values: if you do not specify these arguments, R will take the default value.
We will see three groups of function in action
1. General function
2. Maths function
3. Statistical function
2.1 General Functions:
We are already familiar with general functions like cbind(), rbind(),range(),sort(),order()
functions. Each of these functions has a specific task, takes arguments to return an output.
Following are important functions one must know.
diff()function:
If you work on time series, you need to stationary the series by taking their lag values. A
stationary process allows constant mean, variance and autocorrelation over time. This mainly
improves the prediction of a time series. It can be easily done with the function diff(). We can
build a random time-series data with a trend and then use the function diff() to stationary the
series. The diff() function accepts one argument, a vector, and return suitable lagged and
iterated difference.
length() function
In many cases, we want to know the length of a vector for computation or to be used in a for
loop. The length() function counts the number of rows in vector x.
Lab Task:
1. To perform basic manipulation of a given data having various data structures using
modern programming language named R.
Apparatus:
Laptop
R
Experimental Procedure:
1. How to Setup R:
1. Start R by double-click on the R icon on your desktop. It will open following windows
in your PC as shown in image.
EXPERIMENT DOMAIN:
OBJECTIVE:
APPARATUS:
PROCEDURE:
(Note: Use all steps you studied in LAB SESSION of this tab to write procedure and to
complete the experiment)
DISCUSSION:
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Q2.: List the name of any two types of data structures used in R?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
_______________
Conclusion /Summary
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________