Tutorial 2 - Histogram
Tutorial 2 - Histogram
Big-Data-Analytics-with-R-and-Hadoop
Data modelling is a machine learning technique to identify the hidden pattern from the historical
dataset, and this pattern will help in future value prediction over the same data. This technique highly
focusses on past user actions and learns their taste. Most of these data modelling techniques have
been adopted by many popular organizations to understand the behaviour of their customers based
on their past transactions. These techniques will analyse data and predict for the customers what they
are looking for. Amazon, Google, Facebook, eBay, LinkedIn, Twitter, and many other organizations are
using data mining for changing the definition applications.
Objective: • A histogram is a visual representation of the distribution of a dataset. As such, the shape
of a histogram is its most evident and informative characteristic: it allows you to easily see where a
relatively large amount of the data is situated and where there is very little data to be found. In other
words, you can see where the middle is in your data distribution, how close the data lie around this
middle and where possible outlier are to be found. Because of all this, histograms are a great way to
get to know your data!
Method:
1. Prior to start the coding, install some packages as below by: Go to Package > Install as
shown in Figure 1.
2. Go to the Packages section and type the following packages and install them a shown in the
Figure 2 .
i. ggplot2
ii. plyr
iii. Shiny
iv. Rpubs
v. devtools.
Big Data
Go to File > Import Dataset > From Excel as shown in Figure 3. The dataset of
“LungCapData” is imported and displayed on the screen as shown in Figure 4.
Warning message:
>library(readxl)
> head(LungCapData)
# A tibble: 6 x 6
LungCap Age Height Smoke Gender Caesarean
<dbl> <dbl> <dbl> <chr> <chr> <chr>
1 6.48 6 62.1 no male no
2 10.1 18 74.7 yes female no
3 9.55 16 69.7 no female yes
4 11.1 14 71 no male no
5 4.8 5 56.9 no male no
6 6.22 11 58.7 no female no
>
# Type help in the brackets of the command you would like help for
> help(hist)
# Or simply through a question mark (?) in front of the command.
> ?hist
#Conversely, we may use the “prob” argument and set this equal “TRUE”, again providing a
capital ‘T’.
> hist(LungCap, prob=TRUE)
> hist(LungCap, prob=T)
#We may change the x and y limits using the “xlim” and “ylim” argument. Here, set the y
limits to run from 0 up to 0.2.
#Now move to change the bin width. To do so we may use the “breaks” argument within the
histogram command.
> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=7)
# 7 breaks point will result in 8 bins being produced.
# We also can state and specify all numbers using sequence commands
> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=seq(from=0, to=16, by=2))
# Now we move to change the title using the “main” argument as well as label the x-axis and
y-axis using “x-lab” or “ylab”.
> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=seq(from=0, to=16, by=2), main="Boxplot of
Lung Capacity", xlab="Lung Capacity")
Big Data
# Next, we move to rotate the values on the y-axis by setting the “las” argument equal to 1.
> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=seq(from=0, to=16, by=2), main="Boxplot of
Lung Capacity", xlab="Lung Capacity", las=1)
#Finally, we discuss to add “density curve” over this plot. It can be done using the “lines”
command.
>lines(density(LungCap))
Big Data
#We can also change the colour of line using col= “red”.
> lines(density(LungCap), col= “red”, lwd=3)