TEB2043 Introduction To Data Science: Descriptive Analytics & Visualization DR Shuhaida Mohamed Shuhidan JAN 2025
TEB2043 Introduction To Data Science: Descriptive Analytics & Visualization DR Shuhaida Mohamed Shuhidan JAN 2025
2
I. Descriptive Analytics
Descriptive Analytics
• The examination of data or content, usually manually performed, to answer the question “What
happened?” (or What is happening?), characterized by traditional business intelligence (BI) and
visualizations such as pie charts, bar charts, line graphs, tables, or generated narratives. (gartner.com)
• The interpretation of historical data to better understand changes that have occurred in a business.
Descriptive analytics describes the use of a range of historic data to draw comparisons. Most commonly
reported financial metrics are a product of descriptive analytics, for example, year-over-year pricing
changes, month-over-month sales growth, the number of users, or the total revenue per subscriber. These
measures all describe what has occurred in a business during a set period. (investopedia.com)
4
Descriptive Statistics
• Data are described in measurements e.g., measures of central tendency and measures of variability or
dispersion.
• Central tendency - represents the whole set of data by a single value. It gives us the location of central
points.
o Mean
o Mode
o Median
• Variability/dispersion - the spread of data or how well is our data is distributed.
o Range
o Variance
o Standard deviation
5
Descriptive Statistics
• Or tail function
tail(data)
6
Descriptive Statistics AL2 Think Pair Share
(Quiz #2 - Part of 2%)
Quiz #2
Define Factor and discuss the meaning of “Factor w/ 3 levels” in the above figure.
7
Descriptive Statistics
• The minimum and maximum values from the data can be determined using min and max functions
respectively.
min(data$Sepal.Length) #this produces 4.3
max(data$Sepal.Length) #this produces 7.9
• Alternatively, minimum and maximum values can be determined by range function.
range(data$Sepal.Length)
• The above code produces 4.3 7.9 which we can extract any of the values using index.
range(data$Sepal.Length)[1] #this produces 4.3
range(data$Sepal.Length) [2] #this produces 7.9
• Similarly, the code above can be written as:
range_val <- range(data$Sepal.Length)
range_val[1]
range_val[2]
• Sometimes, we may want to determine the range value; we may do that using min and max functions:
the_range <- max(data$Sepal.Length)-min(data$Sepal.Length)
the_range #this produces 3.6
8
Descriptive Statistics AL3 In-Class Teams
(Quiz #2 - Part of 2%)
• The mean and median of data can be determined with the mean and median functions.
mean(data$Sepal.Length) #this produces 5.843333
median(data$Sepal.Length) #this produces 5.8
Quiz #2:
How about mode? Investigate about the mode calculation in R
9
Descriptive Statistics
• Standard deviation and variance can be determined with the sd and var functions respectively.
sd(data$Sepal.Length) #this produces 0.8280661
var(data$Sepal.Length) #this produces 0.6856935
• Variance (σ2) is the average of the squared differences from the Mean.
Normal distribution
10
Descriptive Statistics
• We have seen how summary function is used to generate useful descriptive statistics.
summary(data) summary(data$Sepal.Length)
11
Descriptive Statistics
sort(A)
n=9
3rd Quartile=0.75*9
Round up 6.75=7
7th value=181.5 12
Descriptive Statistics
13
Descriptive Statistics
• Other than iris, there are a number of other built-in datasets provided by R:
data()
• Import mtcars data:
data <- mtcars
• View the mtcars data:
data
• To plot a scatter plot, x and y values have to be defined, and plot function is used:
x<- -10:10
y<-x*x
plot(x,y,xlab='x',ylab='y',col='red')
16
Scatter Plot
17
Histogram
hist(data$mpg,col=“green”)
18
Bar Chart
val=data$mpg
carnames=row.names(data)
barplot(val,ylab='mpg',main="Car - MPG",names.arg= carnames,
cex.names=0.6,las=2,col="blue")
19
More on Scatter Plot
my<-read.csv("C:/Users/nurulaida/OneDrive - Universiti Teknologi
PETRONAS/R/IDS/covid_my.csv")
my
x<-1:15
y<-my$Confirmed
plot(x,y,pch=16,col='blue',ylab='Confirmed case',main="Covid-19 Confirmed
Cases in Malaysia")
text(x,y,labels=my$State,pos=4,cex=0.5)
20
More on Bar Chart
val=my$Deaths
name_st=my$State
barplot(val,ylab='Deaths',main="Covid-19 Deaths in Malaysia",names.arg=
name_st, cex.names=0.6,las=2,col="orange")
21
Pie Chart
lbl=my$State
val2=my$Confirmed
pie(val2,lbl,cex=0.5)
22
3D Pie Chart
library(plotrix)
val2=my$Confirmed
lbl=my$State
pie3D(val2,
col = hcl.colors(length(val2), "Spectral"),
border = "white",
labels=lbl,labelcex=0.5)
23
Exploded 3D Pie Chart
library(plotrix)
val2=my$Confirmed
lbl=my$State
pie3D(val2,
col = hcl.colors(length(val2), "Spectral"),
border = "white",
labels=lbl,labelcex=0.5,explode=0.2)
24
Exploded 3D Pie Chart
val2=my[my$Confirmed>300000,]
val3=val2$Confirmed
lbl=paste(val2$State,val2$Confirmed,sep=",")
pie3D(val3,
col = hcl.colors(length(val3), "Spectral"),
border = "white",
labels=lbl,labelcex=0.5,explode=0.2)
25
Case Study 1 – Stacked Bar Chart AL5 Reflection
(Quiz #2 - Part of 2%)
val2=my[my$Confirmed>300000,]
tbl1=val2[c("Confirmed","Population")]
legendval=c("Confirmed","Population")
colors=c("green","orange")
row.names(tbl1)=val2$State
tbl2=t(tbl1)
options(scipen=999)
barplot(as.matrix(tbl2),col=colors,cex.names=0.8,las=2, cex.axis=0.8)
legend("topright", legendval, cex=0.8, fill=colors)
26
Case Study 2 – Geomap AL5 Reflection
(Quiz #2 - Part of 2%)
28
Summary
You have learned…
✓ Descriptive analytics
✓ Descriptive statistics that are important for descriptive analytics (and data preparation)
✓ Visualization
Next…
❖ More on data cleaning
❖ Feature settings
29