0% found this document useful (0 votes)
19 views

Using R For Data Preprocessing, Exploratory Analysis, Visualization

1. The document discusses using R for data preprocessing, exploratory analysis, and visualization. It covers examining data distributions, making stem-and-leaf plots, creating histograms, and generating random numbers. 2. Methods for exploring the iris data set are demonstrated, including checking its size and structure, viewing attributes, examining individual values, and using functions like summary(), range(), quantile(), var(), cor(), and aggregate(). 3. The document serves as a tutorial on basic exploratory data analysis techniques in R like assessing data distributions, visualizing data with histograms and density plots, and investigating relationships with correlation.

Uploaded by

Nikita Desai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Using R For Data Preprocessing, Exploratory Analysis, Visualization

1. The document discusses using R for data preprocessing, exploratory analysis, and visualization. It covers examining data distributions, making stem-and-leaf plots, creating histograms, and generating random numbers. 2. Methods for exploring the iris data set are demonstrated, including checking its size and structure, viewing attributes, examining individual values, and using functions like summary(), range(), quantile(), var(), cor(), and aggregate(). 3. The document serves as a tutorial on basic exploratory data analysis techniques in R like assessing data distributions, visualizing data with histograms and density plots, and investigating relationships with correlation.

Uploaded by

Nikita Desai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment-2

Using R for data preprocessing, exploratory analysis, visualization.

Data Distribution

You usually want to know if the observations are clustered around some middle point (the average) and if there
are observations way out on their own (outliers). This is all related to the distribution of the data. There are many
distributions, but common ones are the normal distribution, Poisson, and binomial. There are also distributions
relating directly to statistical tests; for example, chi-squared etc.

Make a Stem and Leaf Plot

The stem() command redraws the data in such a way that you can see the range of numeric categories
on the left and a representation of the frequency on the right.

> data2
[1] 4 5 7 3 4
> table(data2)
data2
3 4 5 7
1 2 1 1
> sort(data2)
[1] 3 4 4 5 7
> data2
[1] 4 5 7 3 4
> table(data2)
data2
3 4 5 7
1 2 1 1
> stem(data2)

The decimal point is at the |

3 | 0
4 | 00
5 | 0
6 |
7 | 0

> stem(data2, scale = 2)

The decimal point is at the |

3 | 0
3 |
4 | 00
4 |
5 | 0
5 |
6 |
6 |
7 | 0
make the scale of the axis wider using the scale = instruction in the command.

Histograms
> data2
[1] 4 5 7 3 4
> hist(data2)
> data4= c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9, 4, 5, 7, 3, 4)
> data4
[1] 3 5 7 5 3 2 6 8 5 6 9 4 5 7 3 4
> hist(data4)
> table(data4)
data4
2 3 4 5 6 7 8 9
1 3 2 4 2 2 1 1
The first bar straddles the range 2 to 3 (that is, greater than 2 but less than 3), and therefore you should expect four
items in this bin.
You can alter the number of columns that are displayed using the breaks = instruction as part of the command. This
instruction will accept several sorts of input; you can use a standard algorithm for calculating the breakpoints,
for example. The default is breaks = “Sturges”, which uses the range of the data to split into bins.
Two other standard algorithms are used: “Scott” and “Freedman-Diaconis”. You can use lowercase and unambiguous
abbreviation; additionally you can use “FD” for the last of these three options:
> hist(data4, breaks = 'Sturges')
> hist(data4, breaks = 'Scott')
> hist(data4, breaks = 'FD')
> hist(data4, col='gray75', main=NULL, xlab = 'Size class for data4',
+ ylim=c(0, 0.3), freq = FALSE)

You use the density() command on a vector of numbers to obtain the kernel density estimate for the vector in
question.
> dens

Call:
density.default(x = data4)

Data: data4 (16 obs.); Bandwidth 'bw' = 0.9644

x y
Min. :-0.8932 Min. :0.0002982
1st Qu.: 2.3034 1st Qu.:0.0134042
Median : 5.5000 Median :0.0694574
Mean : 5.5000 Mean :0.0781187
3rd Qu.: 8.6966 3rd Qu.:0.1396352
Max. :11.8932 Max. :0.1798531

Syntax:-

density(x, bw = 'nrd0', kernel = 'gaussian', na.rm = FALSE)


You specify your data, which must be a numerical vector, followed by the bandwidth. The bandwidth defaults to
the nrd0 algorithm, but you have several others to choose from or you can specify a value.
The kernel = instruction enables you to select one of several smoothing options, the default being the “gaussian”
smoother. You can see the various options from the help entry for this command. By default, NA items are not
removed and an error will result if they are present; you can add na.rm =TRUE to ensure that you strip out any NA
items.
> names(dens)
[1] "x" "y" "bw" "n" "call" "data.name"
[7] "has.na"
> str(dens)
List of 7
$ x : num [1:512] -0.893 -0.868 -0.843 -0.818 -0.793 ...
$ y : num [1:512] 0.000313 0.000339 0.000367 0.000397 0.000429 ...
$ bw : num 0.964
$ n : int 16
$ call : language density.default(x = data4)
$ data.name: chr "data4"
$ has.na : logi FALSE
- attr(*, "class")= chr "density"
> plot(dens$x, dens$y)
> plot(density(data4))

Adding Density Lines to Existing Graphs


> hist(data4, freq = F, col = 'gray85')
> lines(density(data4), lty = 2)
> lines(density(data4, k = 'rectangular'))

Random Number generation


R has the ability to use a variety of random number-generating algorithms (for more
details, look at help(RNG) to bring up the appropriate help entry). You can alter the
algorithm by using the RNGkind() command.
> RNGkind()
[1] "Mersenne-Twister" "Inversion"
> RNGkind(kind = 'Super', normal.kind = 'Box')
> RNGkind()
[1] "Super-Duper" "Box-Muller"
> RNGkind('default')
> RNGkind()
[1] "Mersenne-Twister" "Box-Muller"
> RNGkind('default', 'default')
> RNGkind()
[1] "Mersenne-Twister" "Inversion"

Random Numbers and Sampling


> data4
[1] 3 5 7 5 3 2 6 8 5 6 9 4 5 7 3 4
> sample(data4, size = 4)
[1] 7 4 5 9
> sample(data4[data4 > 5], size = 3)
[1] 7 8 9
> sample(data4[data4 > 5])
[1] 9 7 6 8 6 7
> data2[data4 > 5]
[1] 7 NA NA NA NA NA
> sample(data4[data4 > 5])
[1] 6 6 8 9 7 7
> sample(data4[data4 > 5])
[1] 6 9 7 7 6 8
> sample(data4[data4 > 5])
[1] 6 8 7 7 6 9
> sample(data4[data4 > 5])
[1] 7 7 6 6 9 8
> sample(data4[data4 > 5])
[1] 7 7 6 9 8 6

Exploratory analysis:-
Write following R Script and see the output in console
SIZE and Structure of Data

> dim(iris)
[1] 150 5
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length"
[4] "Petal.Width" "Species"
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1
1 1 ...

Attributes of Data
> attributes(iris)
$names
[1] "Sepal.Length" "Sepal.Width" "Petal.Length"
[4] "Petal.Width" "Species"

$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[15] 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[29] 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[57] 57 58 59 60 61 62 63 64 65 66 67 68 69 70
[71] 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98
[99] 99 100 101 102 103 104 105 106 107 108 109 110 111 112
[113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140
[141] 141 142 143 144 145 146 147 148 149 150

$class
[1] "data.frame"

Now Check first row of Data

> iris[1:3, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

> iris[1:3, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> tail(iris,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width
148 6.5 3.0 5.2 2.0
149 6.2 3.4 5.4 2.3
150 5.9 3.0 5.1 1.8
Species
148 virginica
149 virginica
150 virginica

Now check first column of data The First 10 values of Sepal.Length

> iris[1:10,"Sepal.Length"]
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

Function summary()
> summary(iris)
Sepal.Length Sepal.Width Petal.Length
Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median :5.800 Median :3.000 Median :4.350
Mean :5.843 Mean :3.057 Mean :3.758
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :7.900 Max. :4.400 Max. :6.900
Petal.Width Species
Min. :0.100 setosa :50
1st Qu.:0.300 versicolor:50
Median :1.300 virginica :50
Mean :1.199
3rd Qu.:1.800
Max. :2.500

Mean, Median, Range and Quartiles

> range(iris$Sepal.Length)
[1] 4.3 7.9
> quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
10% 30% 65%
4.80 5.27 6.20

Variance and Histogram, Plot Density,table,Piechart,Barchart

>var(iris$Sepal.Length)
[1] 0.6856935
> hist(iris$Sepal.Length)
> plot(density(iris$Sepal.Length))
> table(iris$Species)

setosa versicolor virginica


50 50 50
> pie(table(iris$Species))
> barplot(table(iris$Species))

Correlation
> cor(iris$Sepal.Length, iris$Petal.Length)
[1] 0.8717538
> cov(iris$Sepal.Length, iris$Petal.Length)
[1] 1.274315
> cov(iris[, 1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
Aggregation

Stats of Sepal.Length for every Species with aggregate()

> aggregate(Sepal.Length ~ Species, summary, data = iris)


Species Sepal.Length.Min. Sepal.Length.1st Qu. Sepal.Length.Median
1 setosa 4.300 4.800 5.000
2 versicolor 4.900 5.600 5.900
3 virginica 4.900 6.225 6.500
Sepal.Length.Mean Sepal.Length.3rd Qu. Sepal.Length.Max.
1 5.006 5.200 5.800
2 5.936 6.300 7.000
3 6.588 6.900 7.900
Boxplot

The bar in the middle is median.


The box shows the interquartile range (IQR), i.e., range between the 75% and 25% observation.

Scatter Plot
> with(iris, plot(Sepal.Length, Sepal.Width, col = Species,
+ pch = as.numeric(Species)))
Scatter Plot with jitter jitter(): add a small amount of noise to the data
> plot(jitter(iris$Sepal.Length), jitter(iris$Sepal.Width))
Matrix of Scatter plot
> pairs(iris)
3D Scatter plot
> library(scatterplot3d)
> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
Interactive 3D Scatter plot
> library(rgl)
> plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
Heat Map

Calculate the similarity between different owners in the iris data with dist() and then plot it with a
heat map
> dist.matrix <- as.matrix(dist(iris[, 1:4]))
> heatmap(dist.matrix)

Level plot:- Function rainbow() creates a vector of contiguous colors.


> library(lattice)
> levelplot(Petal.Width ~ Sepal.Length * Sepal.Width, iris, cuts = 9,
+ col.regions = rainbow(10)[10:1])

Counter plot:- contour() and filled.contour() in package graphics contourplot() in package lattice
> levelplot(Petal.Width ~ Sepal.Length * Sepal.Width, iris, cuts = 9,
+ col.regions = rainbow(10)[10:1])
> filled.contour(volcano, color = terrain.colors, asp = 1, plot.axes =
contour(volcano,
+
add = T))
> 100
[1] 100
> 120
[1] 120
> 140
[1] 140
> 160
[1] 160
> 180
[1] 180

3D Surface

> persp(volcano, theta = 25, phi = 30, expand = 0.5, col = "lightblue")

Parallel Coordinates
> library(MASS)
> parcoord(iris[1:4], col = iris$Species)

Parallel Coordinates with Package lattice


> library(lattice)
> parallelplot(~iris[1:4] | Species, data = iris)

Visualization with Package ggplot2

> library(ggplot2)
> qplot(Sepal.Length, Sepal.Width, data = iris, facets = Species ~ .)

Save Charts to Files


I Save charts to PDF and PS Files: pdf() and postscript()BMP, JPEG, PNG and
TIFF Files: bmp(), jpeg(), png() and tiff()
I Close Files (or graphics devices) with graphics.off() or
dev.off() after plotting
pdf("myPlot.pdf")
x <- 1:50
plot(x, log(x))
graphics.off()
# Save as a postscript file
postscript("myPlot2.ps")
x <- -20:20
plot(x, x^2)
graphics.off()

References:- http://www.RDataMining.com/docs

*******************************THE END************************************

You might also like