STA1040 MidSem Exam
STA1040 MidSem Exam
2023-10-19
library(readxl)
install.packages('tidyverse', repos='http://cran.us.r-project.org')
install.packages('finalfit', repos='http://cran.us.r-project.org')
install.packages('dplyr', repos='http://cran.us.r-project.org')
1
## Warning: restored ’dplyr’
##
## The downloaded binary packages are in
## C:\Users\ADMIN\AppData\Local\Temp\RtmpayQgvO\downloaded_packages
library(dplyr)
##
## Attaching package: ’dplyr’
library(tidyverse)
library(finalfit)
Telecommunication_Data = read_excel("D:/Documents/School Documents/USIU/Y2/Y2S3/STA1040/Telecommunicatio
View(Telecommunication_Data)
data = data.frame(Telecommunication_Data)
Perform any 5 data manipulation or data cleaning techniques. State and describe
the technique being applied. Illustrate R codes and R outputs plus interpretation
2
Missing values map
region
tenure
age
marital
address
income
employ
retire
gender
reside
tollfree
tollten
equipten
cardten
wireten
loglong
logtoll
logequi
logcard
logwire
custcat
churn
0 250 500 750 1000
Observation
3
Missing values map
region
tenure
age
marital
address
income
employ
retire
gender
reside
tollfree
tollten
equipten
cardten
wireten
loglong
logtoll
logequi
logcard
logwire
custcat
churn
0 250 500 750 1000
Observation
"The above code locates any and all NA values within the dataframe and replaces them with
the integer 0. This makes subsequent data analysis easier as there are no longer any conflicts
of data types or NA values interrupting calculations"
## [1] "The above code locates any and all NA values within the dataframe and replaces them with \nthe i
4
Missing values map
region
tenure
age
marital
address
income
employ
retire
gender
reside
tollfree
tollten
equipten
cardten
wireten
loglong
logtoll
logequi
logcard
logwire
custcat
churn
0 25 50 75 100 125
Observation
summary(data2)
5
## 1st Qu.: 624.75 1st Qu.: 222.5 1st Qu.: 507.65 1st Qu.:1.691
## Median :1481.35 Median : 540.0 Median :1217.35 Median :2.116
## Mean :1590.61 Mean : 750.5 Mean :1595.79 Mean :2.183
## 3rd Qu.:2448.28 3rd Qu.:1022.5 3rd Qu.:2405.97 3rd Qu.:2.657
## Max. :4167.70 Max. :4975.0 Max. :6444.95 Max. :4.072
## logtoll logequi logcard logwire
## Min. :2.546 Min. :3.357 Min. :1.322 Min. :2.874
## 1st Qu.:3.002 1st Qu.:3.669 1st Qu.:2.536 1st Qu.:3.449
## Median :3.199 Median :3.764 Median :2.918 Median :3.696
## Mean :3.239 Mean :3.793 Mean :2.897 Mean :3.705
## 3rd Qu.:3.493 3rd Qu.:3.935 3rd Qu.:3.178 3rd Qu.:3.930
## Max. :4.208 Max. :4.353 Max. :4.241 Max. :4.698
## custcat churn
## Length:119 Length:119
## Class :character Class :character
## Mode :character Mode :character
##
##
##
View(data2)
"The above code locates all rows with NA values within the data frame and removes them from the
dataframe leaving you with fewer observations than before however all of them have all the variable data
## [1] "The above code locates all rows with NA values within the data frame and removes them from the\n
#3. Combining the 'region' and 'address' variables to create one comprehensive locator variable 'Address
data3 = data %>% replace(is.na(.), 'x')
View(data3)
data3 = data.frame(unite(data3, col = "Address", c('region', 'address'), sep = ", "))
#View(Address)
"The above code firstly creates a new dataframe where all the NA values have been replaced with the char
## [1] "The above code firstly creates a new dataframe where all the NA values have been replaced with t
## [1] "The above code creates a subset of the initial dataframe by scanning the ’marital’ column of the
#5. Removing a column from the Data due to high frequency of missing values / poor data quality
data %>% missing_plot()
6
Missing values map
region
tenure
age
marital
address
income
employ
retire
gender
reside
tollfree
tollten
equipten
cardten
wireten
loglong
logtoll
logequi
logcard
logwire
custcat
churn
0 250 500 750 1000
Observation
na_sum = colSums(is.na(data))
print(na_sum)
"The 'logwire' colun has the highest frequency of NA values hence will be the column to be removed"
## [1] "The ’logwire’ colun has the highest frequency of NA values hence will be the column to be remove
7
colnames(data6)
"The above code firstyl creates a visualisation of the location of NA values throughout the dataset so a
## [1] "The above code firstyl creates a visualisation of the location of NA values throughout the datas
Consider any of the newly created data in 1(a) above. Describe the data being
used
" Considering 'data5' from the above question; the data is a subset of the parent Telecommunication' dat
only contains entries from respondents who are married irregardless of age or income level"
## [1] " Considering ’data5’ from the above question; the data is a subset of the parent Telecommunicati
Provide appropriate descriptive summaries of any two variables using the chosen
"Using data5"
#range(income)
#table(income)
histogram1 = hist(income, xlab = 'Income', ylab = '# of Respondents', main = 'Histogram of Income of Mar
8
Histogram of Income of Married respondents
50
40
# of Respondents
30
20
10
0
Income
#plot(income)
boxplot1 = boxplot(income, main = 'Boxplot of Income of Married respondents', ylab = 'Income')
9
Boxplot of Income of Married respondents
100 200 300 400 500 600
Income
boxplot1
## $stats
## [,1]
## [1,] 15
## [2,] 34
## [3,] 55
## [4,] 80
## [5,] 145
##
## $n
## [1] 61
##
## $conf
## [,1]
## [1,] 45.69428
## [2,] 64.30572
##
## $out
## [1] 163 359 301 294 228 163 591 262 256 162
##
## $group
## [1] 1 1 1 1 1 1 1 1 1 1
##
## $names
## [1] ""
10
histogram1
## $breaks
## [1] 0 100 200 300 400 500 600
##
## $counts
## [1] 49 5 4 2 0 1
##
## $density
## [1] 0.0080327869 0.0008196721 0.0006557377 0.0003278689 0.0000000000
## [6] 0.0001639344
##
## $mids
## [1] 50 150 250 350 450 550
##
## $xname
## [1] "income"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## FALSE TRUE
## 58 3
"From the above analysis it is clear that among the Married respondents, income is not normally distribu
## [1] "From the above analysis it is clear that among the Married respondents, income is not normally d
## Male
## 29
femalesum
## Female
## 32
11
## Female
## 52.45902
maleprop
## Male
## 47.54098
"From the above analysis it is clear to see that, among the married respondents their is very little gen
## [1] "From the above analysis it is clear to see that, among the married respondents their is very lit
12