Data Analytics and R Assignment I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

DATA ANALYTICS AND R

ASSIGNMENT I

CALCULATE STATISTICAL INDICATORS OF THE DATASET


AND PLOT GRAPHS USING THE GEOM FUNCTIONS

SUBMITTED TO:
MR. SUBHANKAR MISHRA
ASST. PROF
NIFT BHUBANESWAR

SUBMITTED BY:
VAISISTHA BAL
BFT/17/470
DEPT. OF FASHION TECHNOLOGY
NIFT BHUBANESWAR

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 1


INTRODUCTION TO R
R is a language and environment for statistical computing and graphics. R provides a
wide variety of statistical (such as linear and nonlinear modelling) and graphical
techniques. R is an integrated suite of software facilities for data manipulation,
calculation and graphical display. Many think R as a statistics system but it is an
environment within which statistical techniques can be implemented.
SELECTED DATA SET
The data set selected for the purpose of this assignment is Orange.
About the dataset
In order to learn about the data set in R, we can write: ?<name of the dataset>
Hence, on typing this in the R console or rscript, ?Orange, it shows the information
about the dataset.
The Orange data frame has 35 rows and 3 columns of records of the growth of orange
trees. It has the following columns:
• Tree: an ordered factor indicating the tree on which the measurement is made.
The ordering is according to increasing maximum diameter.
• Age: a numeric vector giving the age of the tree.
• Circumference: a numeric vector of trunk circumferences (mm). This is probably
“circumference at breast height”, a standard measurement in forestry.
STEP I: STATISTICAL FUNCTIONS
There are various statistical functions which can be carried out in R, such as mean,
standard deviation, variance, etc.
• To select the dataset:
library(help = "datasets")
data()
Orange
• To select the required column on which to perform the statistical functions:
Orange['circumference']
• Assigning a variable to a selected range of data from the “circumference column
of the Orange dataset:
v= Orange[1:10, 3]
• To display the contents in the variable v:
table(v)

• To calculate Mean:

mean(v, trim = 1/10)

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 2


RESULT: 91.875
INTERPRETATION: The trim function removes 10% of the indentation or the leading
and trailing whitespace from the first and last lines while maintaining the mean. The
mean for the column “circumference” and the selected range “v” is 91. 87 which is the
average of the circumference in mm.
• To calculate standard deviation:
sd(v)
RESULT: 42.17424
INTERPRETATION: This function shows the quantity by which each of the data point
s differ from the mean of the values.
• To sort:
sort(v)
RESULT: 30 33 58 69 87 111 115 120 142 145
INTERPRETATION: sorts the values in the selected range in ascending order.
• To calculate inter quartile range:
IQR(v)
RESULT: 58
INTERPRETATION: It calculates the interquartile range for the given dataset. In this
case, the data range is stored in v. IQR is equal to 58 means, it removes 25% of the
data from the front and the end and shows how spread out the middle range is. IQR c
an be performed on an ordered data set from smallest to highest. It shows the likeliho
od of where the new data point will be within the data set.
• To calculate quantile:
quantile(v)
RESULT: 0% 25% 50% 75% 100%
30.00 60.75 99.00 118.75 145.00
INTERPRETATION: The quantile shows the range of data from lowest to highest. 0
% is the lowest point which is equal to 30 and 100% is the highest range which is eq
ual to 145. 50% shows the median equal to 99. 25% shows the median of 0% to 50
% which is equal to 60 and 75% is the median from 50% to 100% equal to 118.
• To calculate median:
Median(v)
RESULT: 99
INTERPRETATION: middle value of the data range is 99.
• To calculate mad
Mad(v)
RESULT: 52.6323
INTERPRETATION: It shows the average absolute difference of column values from
each other.
• To display stem: stem(v)

RESULT: The decimal point is 2 digit(s) to the right of the |

0 | 33
0 | 679
1 | 1224
1|5

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 3


INTERPRETATION: Stem is a visualization technique that’s used to understand the
data distribution
• To find out variance:
var(v)
RESULT: 1778. 667
• To find out the maximum:
max(v)
RESULT: 145
INTERPRETATION: In the specified data range, max shows the highest value, i.e.
145.
• To find out the minimum:
min(v)
RESULT: 30
INTERPRETATION: in the specified range, min function shows the lowest value, ie.
30.
BASIC GRAPHS
• barplot(table(v))

INTERPRETATION: circumference is plotted on the x axis


• hist(v)

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 4


INTERPRETATION: shows the distribution of circumference between the points with
the highest circumference values for the specified data range falling between
• rug(v)

INTERPRETATION: Rug displays individual points on the graph, where the values are
more likely to occur.
• pie(table(v))

• Rainbow Chart
v <- 200
pie(rep(1, v), labels = "", col = rainbow(v),
border = NA,
main = "pie(*, labels=\"\", col=rainbow(v),
border=NA,..")

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 5


STEP II: GEOM FUNCTIONS
• Line plot
ggplot(Orange, aes(x=circumference, y=age)) +
geom_smooth(method="loess", se=F) +
labs(subtitle="Cicumference vs Age",
y="Cicumference",
x="Age",
title="Line Plot")
RESULT:

INFERENCE: # Line Plot shows the direct increase in the $circumference of the tree
trunk with increase their $age
# It can be visually observed that circumference maintains an approximately constant
increase.
• Scatter Plot
ggplot(Orange, aes(x=age, y=circumference)) +
geom_point(aes(col=Tree)) +
labs(subtitle="Age vs Circumference",
y="Circumference",
x="Age",
title="Scatterplot")

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 6


RESULT:

INFERENCE: scatter graph of Age vs Circumference of different trees colored by the


tree types. For ex: as we can see, the tree type 4 are really small when they are young
but they become thicker as they grow older. However, the tree type 3 have a gradual
growth. We can also see, most of the trees have the similar circumference(thickness)
but grow at much different rates when they grow older.
• Counts Plot
ggplot(Orange, aes(age, circumference)) +
geom_count(col="tomato3", show.legend=F) +
labs(subtitle="age vs circumference",
y="Circumference",
x="Age",
title="Counts Plot")
RESULT:

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 7


INFERENCE: It works in the same way as scatter plot but when there is an overlap of
values or points, the area of the circle increases. It is shown in the beginning that all
the trees have a similar circumference at an younger age.
• Lollipop Plot
avg_circum <- aggregate(circumference ~ Tree, Orange, mean)
lolp <- ggplot(avg_circum, aes(x=Tree, y=circumference)) +
geom_point(size=3) +
geom_segment(aes(x=Tree, xend=Tree, y=0, yend=circumference)) +
labs(title="Lollipop Chart of Mean Circumference per Tree",
subtitle="Tree vs Circumference")
plot(lolp)
RESULT:

INFERENCE: This chart describes the average circumference per tree type in
increasing order. As we can see here, the graph shows that tree type 4 has the highest
average circumference.
• Box plot
ggplot(Orange2, aes(age, circumference))
boxp + geom_boxplot(varwidth=T, fill="plum") +
labs(title="Box plot",
subtitle="Circumference grouped by Age of the Tree",
x="Age group of the Tree",
y="Circumference of the Tree")

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 8


RESULT:

INFERENCE: the box plot describes the IQR of circumference for each age group.
Each of the lines on a box plot represents the median. The larger the box size, the
more different types of values are in it.
• Violin plot
vp <- ggplot(Orange, aes(Tree, circumference, color=Tree))
vp + geom_violin(trim=FALSE) +
geom_boxplot(width=0.1)
labs(title="Violin plot",
subtitle="Circumference vs Tree",
x="Type of Tree",
y="Circumference")
RESULT:

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 9


INFERENCE: Describes the density within each group of Tree types. The violins width
shows the density. The wider they are, closer the values are to each other. The thinner
they are, more spread out are the values. In this case, it is categorised by tree types.
For tree type 3, the values are of circumference are closer to each other, as compared
to tree type 4, in which they are distant; it means they have a very high standard
deviation.
GEOM FUNCTIONS FAILED TO WORK:
On the selected data set, i.e. Orange, few of the functions failed to work.
• Density plot: the kernel density estimation plot didn’t show a kernel-like plot. It
may happen sometimes owing to the limited features of the dataset. In this data
set, we have only one feature, i.e. circumference. “Age” works as more of a
factor than a feature. It so happened due to the lack of higher dimensions in the
dataset
ggplot(data = Orange) +
geom_density(mapping = aes(circumference), colour="blue")

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 10


REFERENCES
• https://ggplot2.tidyverse.org/reference/
• https://smac-group.github.io/ds/index.html

NATIONAL INSTITUTE OF FASHION TECHNOLOGY 11

You might also like