0% found this document useful (0 votes)
128 views9 pages

Tutorial 2 - Histogram

1. The document discusses making histograms in R by analyzing lung capacity data. It shows how to import data, generate a histogram with default settings, and customize histograms by changing axes, bins, titles, labels, and adding density curves. 2. Key steps include importing data, generating a histogram with the hist() function, and customizing aspects like changing from frequencies to probability density, setting axis limits, varying the number of bins, and adding titles/labels. 3. Advanced customization allows rotating y-axis labels, and overlaying density curves estimated from the data using the lines() and density() functions.

Uploaded by

Anwar Zainuddin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views9 pages

Tutorial 2 - Histogram

1. The document discusses making histograms in R by analyzing lung capacity data. It shows how to import data, generate a histogram with default settings, and customize histograms by changing axes, bins, titles, labels, and adding density curves. 2. Key steps include importing data, generating a histogram with the hist() function, and customizing aspects like changing from frequencies to probability density, setting axis limits, varying the number of bins, and adding titles/labels. 3. Advanced customization allows rotating y-axis labels, and overlaying density curves estimated from the data using the lines() and density() functions.

Uploaded by

Anwar Zainuddin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Big Data

Big-Data-Analytics-with-R-and-Hadoop

Trainer: Ts. Dr. Ahmad Anwar Zainuddin

Performing data modelling in R

Data modelling is a machine learning technique to identify the hidden pattern from the historical
dataset, and this pattern will help in future value prediction over the same data. This technique highly
focusses on past user actions and learns their taste. Most of these data modelling techniques have
been adopted by many popular organizations to understand the behaviour of their customers based
on their past transactions. These techniques will analyse data and predict for the customers what they
are looking for. Amazon, Google, Facebook, eBay, LinkedIn, Twitter, and many other organizations are
using data mining for changing the definition applications.

Tutorial 2: Making Histogram in R : www.youtube.com/watch?v=Hj1pgap4UOY

Objective: • A histogram is a visual representation of the distribution of a dataset. As such, the shape
of a histogram is its most evident and informative characteristic: it allows you to easily see where a
relatively large amount of the data is situated and where there is very little data to be found. In other
words, you can see where the middle is in your data distribution, how close the data lie around this
middle and where possible outlier are to be found. Because of all this, histograms are a great way to
get to know your data!

Method:
1. Prior to start the coding, install some packages as below by: Go to Package > Install as
shown in Figure 1.

Figure 1 : Install Packages

2. Go to the Packages section and type the following packages and install them a shown in the
Figure 2 .
i. ggplot2
ii. plyr
iii. Shiny
iv. Rpubs
v. devtools.
Big Data

Figure 2: ggplot2 Installing Package

3. Import the dataset from this link:


http://www.mediafire.com/file/nayf5x3fz208wm8/BigData-with-R-LungCapData.zip/file

Go to File > Import Dataset > From Excel as shown in Figure 3. The dataset of
“LungCapData” is imported and displayed on the screen as shown in Figure 4.

Figure 3: Import Dataset


Big Data

Figure 4: The LungCapData is imported

4. Type the following coding below :

> LungCapData <- read.table(file.choose(), header=T, sep="\t")

Warning message:

In read.table(file.choose(), header = T, sep = "\t") :incomplete final


line found by readTableHeader on 'D:\Big Data Practices\SLIDES\Big
Data with SQL\R\Dataset\LungCapData\LungCapData.xls'

If this warning appears, kindly import the dataset again as shown in 3.

>library(readxl)

> LungCapData <- read_excel("D:/Big Data Practices/SLIDES/Big Data with


SQL/R/Dataset/LungCapData/LungCapData.xls")

# To view, attach and state the dataset


> View(LungCapData)
> attach(LungCapData)
> names(LungCapData)
[1] "LungCap" "Age" "Height" "Smoke" "Gender" "Caesarean"
Big Data

> head(LungCapData)

# A tibble: 6 x 6
LungCap Age Height Smoke Gender Caesarean
<dbl> <dbl> <dbl> <chr> <chr> <chr>
1 6.48 6 62.1 no male no
2 10.1 18 74.7 yes female no
3 9.55 16 69.7 no female yes
4 11.1 14 71 no male no
5 4.8 5 56.9 no male no
6 6.22 11 58.7 no female no
>
# Type help in the brackets of the command you would like help for
> help(hist)
# Or simply through a question mark (?) in front of the command.
> ?hist

# To produce histogram of Lung Capacity


> hist(LungCap)
#You will notice the default in R, is to report “frequencies”, a default “title” and a “bin
width” that is determined by R.
Big Data

# Now we move to change the default of the values:


#1. Change the Y-axis to represent a “probability density “rather than “frequencies”. To do
so we can use “freq” argument and set this equal to “false”.
> hist(LungCap, freq=FALSE)
#or simply write to capital “F”.
> hist(LungCap, freq=F)

#Conversely, we may use the “prob” argument and set this equal “TRUE”, again providing a
capital ‘T’.
> hist(LungCap, prob=TRUE)
> hist(LungCap, prob=T)

#We may change the x and y limits using the “xlim” and “ylim” argument. Here, set the y
limits to run from 0 up to 0.2.

> hist(LungCap, prob=T, ylim=c(0, 0.2))


Big Data

#Now move to change the bin width. To do so we may use the “breaks” argument within the
histogram command.
> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=7)
# 7 breaks point will result in 8 bins being produced.

> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=14)


# 14 breaks point will result in 15 bins being produced.

> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=24)


# 24 breaks point will result in 25 bins being produced.
Big Data

# We also can state and specify all numbers individually.


> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=c(0,2,4,6,8,10,12,14,16))

# We also can state and specify all numbers using sequence commands
> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=seq(from=0, to=16, by=2))

# Now we move to change the title using the “main” argument as well as label the x-axis and
y-axis using “x-lab” or “ylab”.
> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=seq(from=0, to=16, by=2), main="Boxplot of
Lung Capacity", xlab="Lung Capacity")
Big Data

# Next, we move to rotate the values on the y-axis by setting the “las” argument equal to 1.
> hist(LungCap, prob=T, ylim=c(0, 0.2), breaks=seq(from=0, to=16, by=2), main="Boxplot of
Lung Capacity", xlab="Lung Capacity", las=1)

#Finally, we discuss to add “density curve” over this plot. It can be done using the “lines”
command.
>lines(density(LungCap))
Big Data

#We can also change the colour of line using col= “red”.
> lines(density(LungCap), col= “red”, lwd=3)

You might also like