0% found this document useful (0 votes)

32 views7 pages

Using R For Data Preprocessing, Exploratory Analysis, Visualization

1. The document discusses using R for data preprocessing, exploratory analysis, and visualization. It covers examining data distributions, making stem-and-leaf plots, creating histograms, and generating random numbers. 2. Methods for exploring the iris data set are demonstrated, including checking its size and structure, viewing attributes, examining individual values, and using functions like summary(), range(), quantile(), var(), cor(), and aggregate(). 3. The document serves as a tutorial on basic exploratory data analysis techniques in R like assessing data distributions, visualizing data with histograms and density plots, and investigating relationships with correlation.

Uploaded by

Nikita Desai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views7 pages

Using R For Data Preprocessing, Exploratory Analysis, Visualization

Uploaded by

Nikita Desai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Assignment-2

Using R for data preprocessing, exploratory analysis, visualization.

Data Distribution

You usually want to know if the observations are clustered around some middle point (the average) and if there
are observations way out on their own (outliers). This is all related to the distribution of the data. There are many
distributions, but common ones are the normal distribution, Poisson, and binomial. There are also distributions
relating directly to statistical tests; for example, chi-squared etc.

Make a Stem and Leaf Plot

The stem() command redraws the data in such a way that you can see the range of numeric categories
on the left and a representation of the frequency on the right.

> data2
[1] 4 5 7 3 4
> table(data2)
data2
3 4 5 7
1 2 1 1
> sort(data2)
[1] 3 4 4 5 7
> data2
[1] 4 5 7 3 4
> table(data2)
data2
3 4 5 7
1 2 1 1
> stem(data2)

The decimal point is at the |

3 | 0
4 | 00
5 | 0
6 |
7 | 0

> stem(data2, scale = 2)

The decimal point is at the |

3 | 0
3 |
4 | 00
4 |
5 | 0
5 |
6 |
6 |
7 | 0
make the scale of the axis wider using the scale = instruction in the command.

Histograms
> data2
[1] 4 5 7 3 4
> hist(data2)
> data4= c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9, 4, 5, 7, 3, 4)
> data4
[1] 3 5 7 5 3 2 6 8 5 6 9 4 5 7 3 4
> hist(data4)
> table(data4)
data4
2 3 4 5 6 7 8 9
1 3 2 4 2 2 1 1
The first bar straddles the range 2 to 3 (that is, greater than 2 but less than 3), and therefore you should expect four
items in this bin.
You can alter the number of columns that are displayed using the breaks = instruction as part of the command. This
instruction will accept several sorts of input; you can use a standard algorithm for calculating the breakpoints,
for example. The default is breaks = “Sturges”, which uses the range of the data to split into bins.
Two other standard algorithms are used: “Scott” and “Freedman-Diaconis”. You can use lowercase and unambiguous
abbreviation; additionally you can use “FD” for the last of these three options:
> hist(data4, breaks = 'Sturges')
> hist(data4, breaks = 'Scott')
> hist(data4, breaks = 'FD')
> hist(data4, col='gray75', main=NULL, xlab = 'Size class for data4',
+ ylim=c(0, 0.3), freq = FALSE)

You use the density() command on a vector of numbers to obtain the kernel density estimate for the vector in
question.
> dens

Call:
density.default(x = data4)

Data: data4 (16 obs.); Bandwidth 'bw' = 0.9644

x y
Min. :-0.8932 Min. :0.0002982
1st Qu.: 2.3034 1st Qu.:0.0134042
Median : 5.5000 Median :0.0694574
Mean : 5.5000 Mean :0.0781187
3rd Qu.: 8.6966 3rd Qu.:0.1396352
Max. :11.8932 Max. :0.1798531

Syntax:-

density(x, bw = 'nrd0', kernel = 'gaussian', na.rm = FALSE)

You specify your data, which must be a numerical vector, followed by the bandwidth. The bandwidth defaults to
the nrd0 algorithm, but you have several others to choose from or you can specify a value.
The kernel = instruction enables you to select one of several smoothing options, the default being the “gaussian”
smoother. You can see the various options from the help entry for this command. By default, NA items are not
removed and an error will result if they are present; you can add na.rm =TRUE to ensure that you strip out any NA
items.
> names(dens)
[1] "x" "y" "bw" "n" "call" "data.name"
[7] "has.na"
> str(dens)
List of 7
$ x : num [1:512] -0.893 -0.868 -0.843 -0.818 -0.793 ...
$ y : num [1:512] 0.000313 0.000339 0.000367 0.000397 0.000429 ...
$ bw : num 0.964
$ n : int 16
$ call : language density.default(x = data4)
$ data.name: chr "data4"
$ has.na : logi FALSE
- attr(*, "class")= chr "density"
> plot(dens$x, dens$y)
> plot(density(data4))

Adding Density Lines to Existing Graphs

> hist(data4, freq = F, col = 'gray85')
> lines(density(data4), lty = 2)
> lines(density(data4, k = 'rectangular'))

Random Number generation

R has the ability to use a variety of random number-generating algorithms (for more
details, look at help(RNG) to bring up the appropriate help entry). You can alter the
algorithm by using the RNGkind() command.
> RNGkind()
[1] "Mersenne-Twister" "Inversion"
> RNGkind(kind = 'Super', normal.kind = 'Box')
> RNGkind()
[1] "Super-Duper" "Box-Muller"
> RNGkind('default')
> RNGkind()
[1] "Mersenne-Twister" "Box-Muller"
> RNGkind('default', 'default')
> RNGkind()
[1] "Mersenne-Twister" "Inversion"

Random Numbers and Sampling

> data4
[1] 3 5 7 5 3 2 6 8 5 6 9 4 5 7 3 4
> sample(data4, size = 4)
[1] 7 4 5 9
> sample(data4[data4 > 5], size = 3)
[1] 7 8 9
> sample(data4[data4 > 5])
[1] 9 7 6 8 6 7
> data2[data4 > 5]
[1] 7 NA NA NA NA NA
> sample(data4[data4 > 5])
[1] 6 6 8 9 7 7
> sample(data4[data4 > 5])
[1] 6 9 7 7 6 8
> sample(data4[data4 > 5])
[1] 6 8 7 7 6 9
> sample(data4[data4 > 5])
[1] 7 7 6 6 9 8
> sample(data4[data4 > 5])
[1] 7 7 6 9 8 6

Exploratory analysis:-
Write following R Script and see the output in console
SIZE and Structure of Data

> dim(iris)
[1] 150 5
> names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length"
[4] "Petal.Width" "Species"
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1
1 1 ...

Attributes of Data
> attributes(iris)
$names
[1] "Sepal.Length" "Sepal.Width" "Petal.Length"
[4] "Petal.Width" "Species"

$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[15] 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[29] 29 30 31 32 33 34 35 36 37 38 39 40 41 42
[43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[57] 57 58 59 60 61 62 63 64 65 66 67 68 69 70
[71] 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98
[99] 99 100 101 102 103 104 105 106 107 108 109 110 111 112
[113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140
[141] 141 142 143 144 145 146 147 148 149 150

$class
[1] "data.frame"

Now Check first row of Data

> iris[1:3, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

> iris[1:3, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> tail(iris,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width
148 6.5 3.0 5.2 2.0
149 6.2 3.4 5.4 2.3
150 5.9 3.0 5.1 1.8
Species
148 virginica
149 virginica
150 virginica

Now check first column of data The First 10 values of Sepal.Length

> iris[1:10,"Sepal.Length"]
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

Function summary()
> summary(iris)
Sepal.Length Sepal.Width Petal.Length
Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median :5.800 Median :3.000 Median :4.350
Mean :5.843 Mean :3.057 Mean :3.758
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :7.900 Max. :4.400 Max. :6.900
Petal.Width Species
Min. :0.100 setosa :50
1st Qu.:0.300 versicolor:50
Median :1.300 virginica :50
Mean :1.199
3rd Qu.:1.800
Max. :2.500

Mean, Median, Range and Quartiles

> range(iris$Sepal.Length)
[1] 4.3 7.9
> quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
10% 30% 65%
4.80 5.27 6.20

Variance and Histogram, Plot Density,table,Piechart,Barchart

>var(iris$Sepal.Length)
[1] 0.6856935
> hist(iris$Sepal.Length)
> plot(density(iris$Sepal.Length))
> table(iris$Species)

setosa versicolor virginica

50 50 50
> pie(table(iris$Species))
> barplot(table(iris$Species))

Correlation
> cor(iris$Sepal.Length, iris$Petal.Length)
[1] 0.8717538
> cov(iris$Sepal.Length, iris$Petal.Length)
[1] 1.274315
> cov(iris[, 1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
Aggregation

Stats of Sepal.Length for every Species with aggregate()

> aggregate(Sepal.Length ~ Species, summary, data = iris)

Species Sepal.Length.Min. Sepal.Length.1st Qu. Sepal.Length.Median
1 setosa 4.300 4.800 5.000
2 versicolor 4.900 5.600 5.900
3 virginica 4.900 6.225 6.500
Sepal.Length.Mean Sepal.Length.3rd Qu. Sepal.Length.Max.
1 5.006 5.200 5.800
2 5.936 6.300 7.000
3 6.588 6.900 7.900
Boxplot

The bar in the middle is median.

The box shows the interquartile range (IQR), i.e., range between the 75% and 25% observation.

Scatter Plot
> with(iris, plot(Sepal.Length, Sepal.Width, col = Species,
+ pch = as.numeric(Species)))
Scatter Plot with jitter jitter(): add a small amount of noise to the data
> plot(jitter(iris$Sepal.Length), jitter(iris$Sepal.Width))
Matrix of Scatter plot
> pairs(iris)
3D Scatter plot
> library(scatterplot3d)
> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
Interactive 3D Scatter plot
> library(rgl)
> plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
Heat Map

Calculate the similarity between different owners in the iris data with dist() and then plot it with a
heat map
> dist.matrix <- as.matrix(dist(iris[, 1:4]))
> heatmap(dist.matrix)

Level plot:- Function rainbow() creates a vector of contiguous colors.

> library(lattice)
> levelplot(Petal.Width ~ Sepal.Length * Sepal.Width, iris, cuts = 9,
+ col.regions = rainbow(10)[10:1])

Counter plot:- contour() and filled.contour() in package graphics contourplot() in package lattice
> levelplot(Petal.Width ~ Sepal.Length * Sepal.Width, iris, cuts = 9,
+ col.regions = rainbow(10)[10:1])
> filled.contour(volcano, color = terrain.colors, asp = 1, plot.axes =
contour(volcano,
+
add = T))
> 100
[1] 100
> 120
[1] 120
> 140
[1] 140
> 160
[1] 160
> 180
[1] 180

3D Surface

> persp(volcano, theta = 25, phi = 30, expand = 0.5, col = "lightblue")

Parallel Coordinates
> library(MASS)
> parcoord(iris[1:4], col = iris$Species)

Parallel Coordinates with Package lattice

> library(lattice)
> parallelplot(~iris[1:4] | Species, data = iris)

Visualization with Package ggplot2

> library(ggplot2)
> qplot(Sepal.Length, Sepal.Width, data = iris, facets = Species ~ .)

Save Charts to Files

I Save charts to PDF and PS Files: pdf() and postscript()BMP, JPEG, PNG and
TIFF Files: bmp(), jpeg(), png() and tiff()
I Close Files (or graphics devices) with graphics.off() or
dev.off() after plotting
pdf("myPlot.pdf")
x <- 1:50
plot(x, log(x))
graphics.off()
# Save as a postscript file
postscript("myPlot2.ps")
x <- -20:20
plot(x, x^2)
graphics.off()

References:- http://www.RDataMining.com/docs

*******************************THE END************************************

STEM PPT
No ratings yet
STEM PPT
16 pages
Dplyr Cheatsheet PDF
100% (1)
Dplyr Cheatsheet PDF
2 pages
Value in Business Markets: What Do We Know? Where Are We Going?
100% (1)
Value in Business Markets: What Do We Know? Where Are We Going?
17 pages
CO2 Data Base PDF
No ratings yet
CO2 Data Base PDF
36 pages
Budget Management Thesis
100% (3)
Budget Management Thesis
5 pages
CH 07
100% (1)
CH 07
12 pages
EDA With R Lab Manual
No ratings yet
EDA With R Lab Manual
110 pages
R Programming
No ratings yet
R Programming
4 pages
So Stac Application To Online Business Start Ups 3
No ratings yet
So Stac Application To Online Business Start Ups 3
9 pages
Smash The Stack
100% (1)
Smash The Stack
29 pages
64209
No ratings yet
64209
80 pages
r program
No ratings yet
r program
22 pages
Mahusay - Bsa211 - Module 2 Self Exercises
100% (1)
Mahusay - Bsa211 - Module 2 Self Exercises
6 pages
Data Wrangling Cheatsheet PDF
No ratings yet
Data Wrangling Cheatsheet PDF
2 pages
WellLife 734
No ratings yet
WellLife 734
6 pages
Tidyverse Cheat Sheet
No ratings yet
Tidyverse Cheat Sheet
1 page
Plot Library Handouts
No ratings yet
Plot Library Handouts
6 pages
Homework R1
No ratings yet
Homework R1
7 pages
R Session - Note2 - Updated
No ratings yet
R Session - Note2 - Updated
7 pages
Anuj Khandelwal 3029 BCP a Business Analytics Continuous Assessment 2
No ratings yet
Anuj Khandelwal 3029 BCP a Business Analytics Continuous Assessment 2
20 pages
STAT 214-T241-Lab 2
No ratings yet
STAT 214-T241-Lab 2
23 pages
R Project Document
No ratings yet
R Project Document
48 pages
Intro Ggplot2-2
No ratings yet
Intro Ggplot2-2
50 pages
A Chemistry Laboratory Platform Enhanced With Virtual Reality For Students' Adaptive Learning
No ratings yet
A Chemistry Laboratory Platform Enhanced With Virtual Reality For Students' Adaptive Learning
10 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
Unit 3 Tools and Methods Used in Cybercrime
No ratings yet
Unit 3 Tools and Methods Used in Cybercrime
118 pages
Babd End-Term
No ratings yet
Babd End-Term
43 pages
Useful R Commands
No ratings yet
Useful R Commands
17 pages
Chapter 3 _STAT1204..
No ratings yet
Chapter 3 _STAT1204..
10 pages
Vansh3089CA2
No ratings yet
Vansh3089CA2
13 pages
UNIT III 3
No ratings yet
UNIT III 3
20 pages
vertopal.com_R_practical
No ratings yet
vertopal.com_R_practical
9 pages
Van's Gauge Installation
No ratings yet
Van's Gauge Installation
10 pages
Babd Mid-Term
No ratings yet
Babd Mid-Term
16 pages
R Programming: 122AD0029 - T.MANISH
No ratings yet
R Programming: 122AD0029 - T.MANISH
21 pages
Summarizing Data
No ratings yet
Summarizing Data
13 pages
k-will-you-play-with-me
No ratings yet
k-will-you-play-with-me
5 pages
Lecture 2 Data Presentation
No ratings yet
Lecture 2 Data Presentation
18 pages
Iris Visual Code
No ratings yet
Iris Visual Code
6 pages
Standards and Competencies For Writing
No ratings yet
Standards and Competencies For Writing
3 pages
Assignment 5'
No ratings yet
Assignment 5'
4 pages
International Journal of Ophthalmology and Clinical Research Ijocr 2 035
No ratings yet
International Journal of Ophthalmology and Clinical Research Ijocr 2 035
5 pages
R Imp Funtions
No ratings yet
R Imp Funtions
10 pages
DC Motors
No ratings yet
DC Motors
3 pages
1) A - 10:100 Barplot (A) : Exercise Number 3
No ratings yet
1) A - 10:100 Barplot (A) : Exercise Number 3
8 pages
Ds Practical
No ratings yet
Ds Practical
25 pages
Data Wrangling Cheatsheet PDF
No ratings yet
Data Wrangling Cheatsheet PDF
2 pages
Nandini_matplotlib_ws
No ratings yet
Nandini_matplotlib_ws
10 pages
_irisDataset_withLegend.R_40072337643e70ee698ba85b0dbff393
No ratings yet
_irisDataset_withLegend.R_40072337643e70ee698ba85b0dbff393
3 pages
ML R Experiment1
No ratings yet
ML R Experiment1
10 pages
R For Data Science: Dplyr Ggplot2
No ratings yet
R For Data Science: Dplyr Ggplot2
1 page
R Complete
No ratings yet
R Complete
24 pages
STATISTICALinference
No ratings yet
STATISTICALinference
5 pages
Merging and Importing Data Additionalmaterial
No ratings yet
Merging and Importing Data Additionalmaterial
2 pages
A Complete Guide To The Iris Dataset in R
No ratings yet
A Complete Guide To The Iris Dataset in R
3 pages
datamining
No ratings yet
datamining
20 pages
Data Science Project
No ratings yet
Data Science Project
31 pages
Introduction To R. Graphical Representation of Multivariate Observations
No ratings yet
Introduction To R. Graphical Representation of Multivariate Observations
5 pages
Design and Analysis of Propeller Blade Geometry Using The PDE Method
100% (1)
Design and Analysis of Propeller Blade Geometry Using The PDE Method
215 pages
r file code
No ratings yet
r file code
16 pages
R.A. 26 - Special Procedure For Reconstitution of Lost or Destroyed Title
No ratings yet
R.A. 26 - Special Procedure For Reconstitution of Lost or Destroyed Title
7 pages
Module 2 Iris Data Set
No ratings yet
Module 2 Iris Data Set
1 page
IRIS Commands Practice
No ratings yet
IRIS Commands Practice
10 pages
Data Exploration and Visualisation With R: Yanchang Zhao
No ratings yet
Data Exploration and Visualisation With R: Yanchang Zhao
45 pages
Data Mining - R Assignment: Konstantinos Stavrou (70134) 11/11/2012
No ratings yet
Data Mining - R Assignment: Konstantinos Stavrou (70134) 11/11/2012
13 pages
Fuel Oil and Diesel Oil Service Systems
No ratings yet
Fuel Oil and Diesel Oil Service Systems
2 pages
Chapter 2 Demand Supply
No ratings yet
Chapter 2 Demand Supply
79 pages
Data Visualization Using R & Ggplot2: Karthik Ram October 6, 2013
No ratings yet
Data Visualization Using R & Ggplot2: Karthik Ram October 6, 2013
78 pages
New Price List Pt. Neo Kosmetika Industri: Category Nama Produk Estimasi Harga Makeup Size (GR Atau ML)
No ratings yet
New Price List Pt. Neo Kosmetika Industri: Category Nama Produk Estimasi Harga Makeup Size (GR Atau ML)
4 pages
Stage 8 End of Unit 6 Test
50% (2)
Stage 8 End of Unit 6 Test
3 pages
(She's A) Bad Mama Jama - Trumpet in Bb
No ratings yet
(She's A) Bad Mama Jama - Trumpet in Bb
2 pages
Soal PG Label Kls 9 60 Soal - 065317
100% (4)
Soal PG Label Kls 9 60 Soal - 065317
18 pages
Using KWL Strategy To Improve Students R
No ratings yet
Using KWL Strategy To Improve Students R
10 pages
R Programs
No ratings yet
R Programs
30 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
Ticketing Handbook 39 2007 Eng
No ratings yet
Ticketing Handbook 39 2007 Eng
348 pages
Earth Moving Equipment
No ratings yet
Earth Moving Equipment
5 pages
Generations in The Workplace
No ratings yet
Generations in The Workplace
42 pages
Statistics With R Programming PDF
No ratings yet
Statistics With R Programming PDF
53 pages
R For Data Science - Tidyverse For Beginners (Ggplot2, Dplyr, Tidyr, Readr, Purr, Tibble, Stringr, Forcats) PDF
No ratings yet
R For Data Science - Tidyverse For Beginners (Ggplot2, Dplyr, Tidyr, Readr, Purr, Tibble, Stringr, Forcats) PDF
1 page
Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection
From Everand
Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection
Bart Baesens
No ratings yet
SAT Math Shortcuts
From Everand
SAT Math Shortcuts
Bella Biscotti
No ratings yet
Solving Math Problems
From Everand
Solving Math Problems
George N. Frempong
No ratings yet
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
From Everand
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
Stuart A. Klugman
4/5 (1)
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)