Using R For Data Preprocessing, Exploratory Analysis, Visualization
Using R For Data Preprocessing, Exploratory Analysis, Visualization
Data Distribution
You usually want to know if the observations are clustered around some middle point (the average) and if there
are observations way out on their own (outliers). This is all related to the distribution of the data. There are many
distributions, but common ones are the normal distribution, Poisson, and binomial. There are also distributions
relating directly to statistical tests; for example, chi-squared etc.
The stem() command redraws the data in such a way that you can see the range of numeric categories
on the left and a representation of the frequency on the right.
> data2
[1] 4 5 7 3 4
> table(data2)
data2
3 4 5 7
1 2 1 1
> sort(data2)
[1] 3 4 4 5 7
> data2
[1] 4 5 7 3 4
> table(data2)
data2
3 4 5 7
1 2 1 1
> stem(data2)
3 | 0
4 | 00
5 | 0
6 |
7 | 0
3 | 0
3 |
4 | 00
4 |
5 | 0
5 |
6 |
6 |
7 | 0
make the scale of the axis wider using the scale = instruction in the command.
Histograms
> data2
[1] 4 5 7 3 4
> hist(data2)
> data4= c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9, 4, 5, 7, 3, 4)
> data4
[1] 3 5 7 5 3 2 6 8 5 6 9 4 5 7 3 4
> hist(data4)
> table(data4)
data4
2 3 4 5 6 7 8 9
1 3 2 4 2 2 1 1
The first bar straddles the range 2 to 3 (that is, greater than 2 but less than 3), and therefore you should expect four
items in this bin.
You can alter the number of columns that are displayed using the breaks = instruction as part of the command. This
instruction will accept several sorts of input; you can use a standard algorithm for calculating the breakpoints,
for example. The default is breaks = “Sturges”, which uses the range of the data to split into bins.
Two other standard algorithms are used: “Scott” and “Freedman-Diaconis”. You can use lowercase and unambiguous
abbreviation; additionally you can use “FD” for the last of these three options:
> hist(data4, breaks = 'Sturges')
> hist(data4, breaks = 'Scott')
> hist(data4, breaks = 'FD')
> hist(data4, col='gray75', main=NULL, xlab = 'Size class for data4',
+ ylim=c(0, 0.3), freq = FALSE)
You use the density() command on a vector of numbers to obtain the kernel density estimate for the vector in
question.
> dens
Call:
density.default(x = data4)
x y
Min. :-0.8932 Min. :0.0002982
1st Qu.: 2.3034 1st Qu.:0.0134042
Median : 5.5000 Median :0.0694574
Mean : 5.5000 Mean :0.0781187
3rd Qu.: 8.6966 3rd Qu.:0.1396352
Max. :11.8932 Max. :0.1798531
Syntax:-
Exploratory analysis:-
Write following R Script and see the output in console
SIZE and Structure of Data
> dim(iris)
[1] 150 5
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length"
[4] "Petal.Width" "Species"
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1
1 1 ...
Attributes of Data
> attributes(iris)
$names
[1] "Sepal.Length" "Sepal.Width" "Petal.Length"
[4] "Petal.Width" "Species"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[15] 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[29] 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[57] 57 58 59 60 61 62 63 64 65 66 67 68 69 70
[71] 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98
[99] 99 100 101 102 103 104 105 106 107 108 109 110 111 112
[113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140
[141] 141 142 143 144 145 146 147 148 149 150
$class
[1] "data.frame"
> iris[1:3, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> iris[1:3, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> tail(iris,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width
148 6.5 3.0 5.2 2.0
149 6.2 3.4 5.4 2.3
150 5.9 3.0 5.1 1.8
Species
148 virginica
149 virginica
150 virginica
> iris[1:10,"Sepal.Length"]
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
Function summary()
> summary(iris)
Sepal.Length Sepal.Width Petal.Length
Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median :5.800 Median :3.000 Median :4.350
Mean :5.843 Mean :3.057 Mean :3.758
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :7.900 Max. :4.400 Max. :6.900
Petal.Width Species
Min. :0.100 setosa :50
1st Qu.:0.300 versicolor:50
Median :1.300 virginica :50
Mean :1.199
3rd Qu.:1.800
Max. :2.500
> range(iris$Sepal.Length)
[1] 4.3 7.9
> quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
10% 30% 65%
4.80 5.27 6.20
>var(iris$Sepal.Length)
[1] 0.6856935
> hist(iris$Sepal.Length)
> plot(density(iris$Sepal.Length))
> table(iris$Species)
Correlation
> cor(iris$Sepal.Length, iris$Petal.Length)
[1] 0.8717538
> cov(iris$Sepal.Length, iris$Petal.Length)
[1] 1.274315
> cov(iris[, 1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
Aggregation
Scatter Plot
> with(iris, plot(Sepal.Length, Sepal.Width, col = Species,
+ pch = as.numeric(Species)))
Scatter Plot with jitter jitter(): add a small amount of noise to the data
> plot(jitter(iris$Sepal.Length), jitter(iris$Sepal.Width))
Matrix of Scatter plot
> pairs(iris)
3D Scatter plot
> library(scatterplot3d)
> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
Interactive 3D Scatter plot
> library(rgl)
> plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
Heat Map
Calculate the similarity between different owners in the iris data with dist() and then plot it with a
heat map
> dist.matrix <- as.matrix(dist(iris[, 1:4]))
> heatmap(dist.matrix)
Counter plot:- contour() and filled.contour() in package graphics contourplot() in package lattice
> levelplot(Petal.Width ~ Sepal.Length * Sepal.Width, iris, cuts = 9,
+ col.regions = rainbow(10)[10:1])
> filled.contour(volcano, color = terrain.colors, asp = 1, plot.axes =
contour(volcano,
+
add = T))
> 100
[1] 100
> 120
[1] 120
> 140
[1] 140
> 160
[1] 160
> 180
[1] 180
3D Surface
> persp(volcano, theta = 25, phi = 30, expand = 0.5, col = "lightblue")
Parallel Coordinates
> library(MASS)
> parcoord(iris[1:4], col = iris$Species)
> library(ggplot2)
> qplot(Sepal.Length, Sepal.Width, data = iris, facets = Species ~ .)
References:- http://www.RDataMining.com/docs
*******************************THE END************************************