0% found this document useful (0 votes)
18 views

Data Mining Lab 2

Uploaded by

Usama Naveed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Data Mining Lab 2

Uploaded by

Usama Naveed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Lab Name: To perform basic manipulation on various data structures of modern programming language

named R.
Course title: Soft Computing and Data mining Lab Total Marks: ___20_________
Practical No. 2 Date of experiment performed: ____________
Course teacher/Lab Instructor: Engr. Muhammad Usman Date of marking: ____________
Student Name:__________________________
Registration no.__________________________

Marking Evaluation Sheet

Knowledge components Domain Taxonomy Contribution Max. Obtained


level marks marks

1. Student is aware with


requirement and use of Imitation (P1) 3
apparatus involved in
experiment.
2. Student has conducted the Psychomotor 70%
experiment by practicing the Manipulate (P2) 11
hands-on skills as per
instructions.
3. Student has achieved required -
Precision (P3)
accuracy in performance.

4. Student is aware of discipline &


safety rules to follow them rules Receiving (A1) 2
Affective
during experiment.
20%
5. Student has responded well and
Respond (A2) 2
contributed affectively in
respective lab activity.
6. Student understands use of Understand.
modern programming languages Cognitive 10% 2
and software environment for (C2)
Data Mining (DM)
Total 20

Normalize
marks out of 5
(5)

Signed by Course teacher/ Lab Instructor


EXPERIMENT # 1
To perform basic manipulation on various data structures of modern programming
language named R.

PRE LAB TASK

Objective:
1. To be familiar with data structures of modern programming language named R.
2. To be familiar with basic manipulation functions of modern programming language
named R
3. To know how to use basic manipulation functions on different structures of modern
programming language named R.

Theory:

1. Data Structures of modern programming language named R.


In R, an object is anything that can be assigned to a variable. This includes constants, data
structures, functions, and even graphs. Hence data structures are special type of objects. Like other
objects data structures have a mode (which describes how the object is stored) and a class (which
tells generic functions like print how to handle it). R has a wide variety of data structures for holding
data, including scalars, vectors, matrices, arrays, data frames, and lists. These data structures are
discussed below.
1.1 Vectors:
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. They
are more than one numbers for simplicity saying. The combine function c() is used to form the
vector, usually. They can also be achieved from data using accessor operator $. In such a case
length function with syntax ‘length(name of data structure)’ is used to know how many entries in
the vector.
Here are examples of each type of vector:
a <- c(1, 2, 5, 3, 6, -2, 4)
b <- c("one", "two", "three")
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
Here, a is a numeric vector, b is a character vector, and c is a logical vector. Note that the data in a
vector must be only one type or mode (numeric, character, or logical). You can’t mix modes in the
same vector.
NOTE Scalars are one-element vectors. Examples include f <- 3, g <- "US",
and h <- TRUE. They’re used to hold constants.
You can refer to elements of a vector using a numeric vector of positions within brackets.
For example, a[c(2, 4)] refers to the second and fourth elements of vector a.
Here are additional examples:
> a <- c("k", "j", "h", "a", "c", "m")
> a[3]
[1] "h"
> a[c(1, 3, 5)]
[1] "k" "h" "c"
> a[2:6]
[1] "j" "h" "a" "c" "m"
The colon operator used in the last statement generates a sequence of numbers. For example,
a <- c(2:6) is equivalent to a <- c(2, 3, 4, 5, 6).
1.2 Matrices:
A matrix is a two-dimensional array in which each element has the same mode (numeric, character,
or logical). Matrices are created with the matrix() function. The
general format is
myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns,
byrow=logical_value, dimnames=list (
char_vector_rownames, char_vector_colnames))
where vector contains the elements for the matrix, nrow and ncol specify the row and column
dimensions, and dimnames contains optional row and column labels stored in character vectors.
The option by row indicates whether the matrix should be filled in byrow (byrow=TRUE) or by
column (byrow=FALSE). The default is by column. The following listing demonstrates the matrix
function.

First you create a 5 × 4 matrix b. Then you create a 2 × 2 matrix with labels and fill the matrix
by rows c. Finally, you create a 2 × 2 matrix and fill the matrix by columns d. You can identify
rows, columns, or elements of a matrix by using subscripts and brackets. X[i,] refers to the ith
row of matrix X, X[,j] refers to the j th column, and X[i, j] refers to the ij th element,
respectively. The subscripts i and j can be numeric vectors in order to select multiple rows or
columns, as shown in the following listing.
In 2.2 above, First a 2 × 5 matrix is created containing the numbers 1 to 10. By default, the
matrix is filled by column. Then the elements in the second row are selected, followed by the
elements in the second column. Next, the element in the first row and fourth column is selected.
Finally, the elements in the first row and the fourth and fifth columns are selected.

Matrices are two-dimensional and, like vectors, can contain only one data type. When there are
more than two dimensions, you use arrays (section 1.3). When there are multiple modes of
data, you use data frames (section 1.4).
1.3 Arrays:
Arrays are similar to matrices but can have more than two dimensions. They’re created
with an array function of the following form

myarray <- array(vector, dimensions, dimnames)

where vector contains the data for the array, dimensions is a numeric vector giving the maximal
index for each dimension, and dimnames is an optional list of dimension labels. The following
listing gives an example of creating a three-dimensional (2 × 3 × 4) array of numbers.

1.4 Data Frames:


A data frame is more general than a matrix in that different columns can contain different
modes of data (numeric, character, and so on). It’s similar to the dataset you’d typically see in
SAS, SPSS, and Stata. Data frames are the most common data structure you’ll deal with in R.
If an example dataset consists of numeric and character data. Because there are multiple modes
of data, you can’t contain the data in a matrix. In this case, a data frame is the structure of
choice.
A data frame is created with the data.frame() function

mydata <- data.frame(col1, col2, col3,...)

where col1, col2, col3, and so on are column vectors of any type (such as character, numeric,
or logical). Names for each column can be provided with the names function.

The following listing makes this clear.

Each column must have only one mode, but you can put columns of different modes together
to form the data frame. Because data frames are close to what analysts typically think of as
datasets, we’ll use the terms columns and variables interchangeably when discussing data
frames.
There are several ways to identify the elements of a data frame. You can use the subscript
notation you used before (for example, with matrices), or you can specify column names. Using
the patientdata data frame created earlier, the following listing demonstrates these approaches.
It can get tiresome typing patientdata$ at the beginning of every variable name, so shortcuts
are available. You can use either the attach() and detach() or with() functions to simplify your
code.
1.5 Factors:
As you’ve seen, variables can be described as nominal, ordinal, or continuous. Nominal
variables are categorical, without an implied order. Diabetes (Type1, Type2) is an example of
a nominal variable. Even if Type1 is coded as a 1 and Type2 is coded as a 2 in the data, no
order is implied. Ordinal variables imply order but not amount. Status (poor, improved,
excellent) is a good example of an ordinal variable. You know that a patient with a poor status
isn’t doing as well as a patient with an improved status, but not by how much. Continuous
variables can take on any value within some range, and both order and amount are implied.
Age in years is a continuous variable and can take on values such as 14.5 or 22.8 and any value
in between. You know that someone who is 15 is one year older than someone who is 14.
Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors.
Factors are crucial in R because they determine how data is analyzed and presented visually.
The function factor() stores the categorical values as a vector of integers in the range [1… k],
(where k is the number of unique values in the nominal variable) and an internal vector of
character strings (the original values) mapped to these integers.
For example, assume that you have this vector:

diabetes <- c("Type1", "Type2", "Type1", "Type1")

The statement diabetes <- factor(diabetes) stores this vector as (1, 2, 1, 1) and associates it with
1 = Type1 and 2 = Type2 internally (the assignment is alphabetical). Any analyses performed
on the vector diabetes will treat the variable as nominal and select the statistical methods
appropriate for this level of measurement.

For vectors representing ordinal variables, you add the parameter ordered=TRUE to
the factor() function. Given the vector

status <- c("Poor", "Improved", "Excellent", "Poor")

the statement status <- factor (status, ordered=TRUE) will encode the vector as (3, 2, 1, 3) and
associate these values internally as 1 = Excellent, 2 = Improved, and 3 = Poor. Additionally,
any analyses performed on this vector will treat the variable as ordinal and select the statistical
methods appropriately.

By default, factor levels for character vectors are created in alphabetical order. This worked for
the status factor, because the order “Excellent,” “Improved,” “Poor” made sense. There would
have been a problem if “Poor” had been coded as “Ailing” instead, because the order would
have been “Ailing,” “Excellent,” “Improved.” A similar problem would exist if the desired
order was “Poor,” “Improved,” “Excellent.” For ordered factors, the alphabetical default is
rarely sufficient. You can override the default by specifying a levels option. For example,
status <- factor(status, order=TRUE,
levels=c("Poor", "Improved", "Excellent"))
assigns the levels as 1 = Poor, 2 = Improved, 3 = Excellent. Be sure the specified levels match
your actual data values. Any data values not in the list will be set to missing. Numeric variables
can be coded as factors using the levels and labels options. If sex was coded as 1 for male and
2 for female in the original data, then

sex <- factor(sex, levels=c(1, 2), labels=c("Male", "Female"))


would convert the variable to an unordered factor. Note that the order of the labels must match
the order of the levels. In this example, sex would be treated as categorical, the labels “Male”
and “Female” would appear in the output instead of 1 and 2, and any sex value that wasn’t
initially coded as a 1 or 2 would be set to missing.

The following listing demonstrates how specifying factors and ordered factors impacts data
analyses.

In listing 2.6 above, First you enter the data as vectors b. Then you specify that diabetes is a
factor and status is an ordered factor. Finally, you combine the data into a data frame. The
function str(object) provides information about an object in R (the data frame, in this case) c. It
clearly shows that diabetes is a factor and status is an ordered factor, along with how they’re
coded internally. Note that the summary() function treats the variables differently d. It provides
the minimum, maximum, mean, and quartiles for the continuous variable age, and frequency
counts for the categorical variables diabetes and status.
1.6 Lists:
Lists are the most complex of the R data types. Basically, a list is an ordered collection of
objects (components). A list allows you to gather a variety of (possibly unrelated) objects under
one name. For example, a list may contain a combination of vectors, matrices, data frames, and
even other lists. You create a list using the list() function
mylist <- list(object1, object2, ...)
where the objects are any of the structures seen so far. Optionally, you can name the
objects in a list:
mylist <- list(name1=object1, name2=object2, ...)
The following listing shows an example.
In this example, you create a list with four components: a string, a numeric vector, a matrix,
and a character vector. You can combine any number of objects and save them as a list. You
can also specify elements of the list by indicating a component number or a name within double
brackets.
In this example, mylist[[2]] and mylist[["ages"]] both refer to the same four-element numeric
vector. For named components, mylist$ages would also work. Lists are important R structures
for two reasons. First, they allow you to organize and recall disparate information in a simple
way. Second, the results of many R functions return lists. It’s up to the analyst to pull out the
components that are needed.

2. Basic Data Manipulation Functions:

A function, in a programming environment, is a set of instructions. A programmer builds a


function to avoid repeating the same task, or reduce complexity.
A function should be
 written to carry out a specified a tasks
 may or may not include arguments
A general approach to a function is to use the argument part as inputs, feed the body part and
finally return an output. The Syntax of a function is the following:

here are a lot of built-in function in R. R matches your input parameters with its function
arguments, either by value or by position, then executes the function body. Function arguments
can have default values: if you do not specify these arguments, R will take the default value.
We will see three groups of function in action
1. General function
2. Maths function
3. Statistical function
2.1 General Functions:
We are already familiar with general functions like cbind(), rbind(),range(),sort(),order()
functions. Each of these functions has a specific task, takes arguments to return an output.
Following are important functions one must know.

diff()function:

If you work on time series, you need to stationary the series by taking their lag values. A
stationary process allows constant mean, variance and autocorrelation over time. This mainly
improves the prediction of a time series. It can be easily done with the function diff(). We can
build a random time-series data with a trend and then use the function diff() to stationary the
series. The diff() function accepts one argument, a vector, and return suitable lagged and
iterated difference.

length() function

In many cases, we want to know the length of a vector for computation or to be used in a for
loop. The length() function counts the number of rows in vector x.

2.2 Math Functions:


R has an array of mathematical functions.

2.3 Statistical Functions:


R standard installation contains wide range of statistical functions. In this tutorial, we will
briefly look at the most important function.
LAB SESSION

Lab Task:
1. To perform basic manipulation of a given data having various data structures using
modern programming language named R.

Apparatus:
 Laptop
 R

Experimental Procedure:

1. How to Setup R:

1. Start-up the Microsoft Windows.


2. Open the website http://cran.r-project.org or use Pin drive to access software folder
named R-3.6.2-win.exe
3. Double click on the software folder and double click on ‘R-3.6.2-win.exe’ file and run
the setup.
4. Press next until you reach the window which ask for the key.
5. Finally chose Finish and close the installation.

2. Get started with R:

1. Start R by double-click on the R icon on your desktop. It will open following windows
in your PC as shown in image.

Fig. 1. R Startup GUI window

2. Install a package named dslabs


> install.packages("dslabs")
3. Load the example dataset stored in library named dslabs with name murders
> library(dslabs)
> data(murders)
4. Check the class of data
>calss(murders)
[1] "data.frame"
5. Examine the dataset in detail
>str(murders)
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
6. Show first six lines of data
>head(murders)
state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
7. Get the name of all variables in dataset
>names(murders)
[1] "state" "abb" "region" "population" "total"
8. Get the data under population varaiable of given data set.
> murders$population
[1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934
[9] 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355
[17] 2853118 4339367 4533372 1328361 5773552 6547629 9883640 5303925
[25] 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179
[33] 19378102 9535483 672591 11536504 3751351 3831074 12702379 1052567
[41] 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540
[49] 1852994 5686986 563626
9. Get the class and length of all variables
>class(murders$population)
>class(murers$state)
>class(murders$abb)
>class(murders$region)
>class(murders$total)
10. Specify the levels of the variables with factors as class
>levels(murders$region)
11. Specify the levels of region variables with total number of variables instead of default
alphabetical order
region <- murders$region
value <- murders$total
region <- reorder(region, value, FUN = sum)
levels(region)
#> [1] "Northeast" "North Central" "West" "South"
12. Access single row and column of dataset understudy.
data("murders")
murders[25, 1]
#> [1] "Mississippi"
murders[2:3, ]
#> state abb region population total
#> 2 Alaska AK West 710231 19
#> 3 Arizona AZ West 6392017 232
Extra Credit Points:
(Follow Similar procedure as well as using PRE-LAB TASK Session data complete the tasks
provided to you as Exercise)

EXPERIMENT DOMAIN:

Domains Psychomotor (70%) Affective (20%) Cognitive


(10%)
Attributes Realization of Conducting Data Data Discipline Individual Understa
Experiment Experiment Collection Analysis Participation nd
(Receiving)
(Awareness) (Act) (Use (Perform) (Respond/
Instrument) Contribute)
Taxonomy P1 P2 P2 P2 A1 A2 C2
Level
Marks 3 5 3 3 3 1 2
distribution
LAB REPORT
Prepare the Lab Report as below:
TITLE:

OBJECTIVE:

APPARATUS:

PROCEDURE:
(Note: Use all steps you studied in LAB SESSION of this tab to write procedure and to
complete the experiment)
DISCUSSION:

Q1.: How you can the packages you have installed in R?

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

Q2.: List the name of any two types of data structures used in R?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
_______________

Conclusion /Summary
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

Domains Psychomotor (70%) Affective (20%) Cognitive


(10%)
Attributes Realization of Conducting Data Data Discipline Individual Understa
Experiment Experiment Collection Analysis Participation nd
(Receiving)
(Awareness) (Act) (Use (Perform) (Respond/
Instrument) Contribute)
Taxonomy P1 P2 P2 P2 A1 A2 C2
Level
Marks 3 5 3 3 3 1 2
distribution
Obtained
Marks

You might also like