0% found this document useful (0 votes)
48 views12 pages

STA1040 MidSem Exam

The document discusses importing data and necessary packages in R. It then performs several data cleaning techniques on the dataframe including [1] replacing missing values with zeros, [2] removing rows with any missing values, and [3] combining two variables into one new variable. It also [4] subsets the data to only include married respondents and [5] considers removing a column due to many missing values.

Uploaded by

gugugaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views12 pages

STA1040 MidSem Exam

The document discusses importing data and necessary packages in R. It then performs several data cleaning techniques on the dataframe including [1] replacing missing values with zeros, [2] removing rows with any missing values, and [3] combining two variables into one new variable. It also [4] subsets the data to only include married respondents and [5] considers removing a column due to many missing values.

Uploaded by

gugugaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

STA1040 MidSem Exam

Mark Bilahi M’rabu

2023-10-19

Import the Data and necessary packages

library(readxl)
install.packages('tidyverse', repos='http://cran.us.r-project.org')

## Installing package into ’C:/Users/ADMIN/AppData/Local/R/win-library/4.3’


## (as ’lib’ is unspecified)

## package ’tidyverse’ successfully unpacked and MD5 sums checked


##
## The downloaded binary packages are in
## C:\Users\ADMIN\AppData\Local\Temp\RtmpayQgvO\downloaded_packages

install.packages('finalfit', repos='http://cran.us.r-project.org')

## Installing package into ’C:/Users/ADMIN/AppData/Local/R/win-library/4.3’


## (as ’lib’ is unspecified)

## package ’finalfit’ successfully unpacked and MD5 sums checked


##
## The downloaded binary packages are in
## C:\Users\ADMIN\AppData\Local\Temp\RtmpayQgvO\downloaded_packages

install.packages('dplyr', repos='http://cran.us.r-project.org')

## Installing package into ’C:/Users/ADMIN/AppData/Local/R/win-library/4.3’


## (as ’lib’ is unspecified)

## package ’dplyr’ successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package ’dplyr’

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying


## C:\Users\ADMIN\AppData\Local\R\win-library\4.3\00LOCK\dplyr\libs\x64\dplyr.dll
## to C:\Users\ADMIN\AppData\Local\R\win-library\4.3\dplyr\libs\x64\dplyr.dll:
## Permission denied

1
## Warning: restored ’dplyr’

##
## The downloaded binary packages are in
## C:\Users\ADMIN\AppData\Local\Temp\RtmpayQgvO\downloaded_packages

library(dplyr)

##
## Attaching package: ’dplyr’

## The following objects are masked from ’package:stats’:


##
## filter, lag

## The following objects are masked from ’package:base’:


##
## intersect, setdiff, setequal, union

library(tidyverse)

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --


## v forcats 1.0.0 v readr 2.1.4
## v ggplot2 3.4.3 v stringr 1.5.0
## v lubridate 1.9.3 v tibble 3.2.1
## v purrr 1.0.2 v tidyr 1.3.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --


## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(finalfit)
Telecommunication_Data = read_excel("D:/Documents/School Documents/USIU/Y2/Y2S3/STA1040/Telecommunicatio
View(Telecommunication_Data)
data = data.frame(Telecommunication_Data)

Perform any 5 data manipulation or data cleaning techniques. State and describe

the technique being applied. Illustrate R codes and R outputs plus interpretation

data %>% missing_plot()

2
Missing values map
region
tenure
age
marital
address
income
employ
retire
gender
reside
tollfree
tollten
equipten
cardten
wireten
loglong
logtoll
logequi
logcard
logwire
custcat
churn
0 250 500 750 1000
Observation

#1. Replacing all NA values with zeros


data1 = data %>% replace(is.na(.), 0)
data1 %>% missing_plot()

3
Missing values map
region
tenure
age
marital
address
income
employ
retire
gender
reside
tollfree
tollten
equipten
cardten
wireten
loglong
logtoll
logequi
logcard
logwire
custcat
churn
0 250 500 750 1000
Observation

"The above code locates any and all NA values within the dataframe and replaces them with
the integer 0. This makes subsequent data analysis easier as there are no longer any conflicts
of data types or NA values interrupting calculations"

## [1] "The above code locates any and all NA values within the dataframe and replaces them with \nthe i

#2. Removing all NA values from the dataframe


data2 = na.omit(data)
data2 %>% missing_plot

4
Missing values map
region
tenure
age
marital
address
income
employ
retire
gender
reside
tollfree
tollten
equipten
cardten
wireten
loglong
logtoll
logequi
logcard
logwire
custcat
churn
0 25 50 75 100 125
Observation

summary(data2)

## region tenure age marital


## Length:119 Min. : 2.00 Min. :20.00 Length:119
## Class :character 1st Qu.:16.00 1st Qu.:31.00 Class :character
## Mode :character Median :34.00 Median :39.00 Mode :character
## Mean :34.91 Mean :40.91
## 3rd Qu.:51.50 3rd Qu.:49.00
## Max. :72.00 Max. :69.00
## address income employ retire
## Min. : 0.00 Min. : 15.00 Min. : 0.00 Length:119
## 1st Qu.: 4.00 1st Qu.: 37.00 1st Qu.: 3.00 Class :character
## Median : 9.00 Median : 57.00 Median : 7.00 Mode :character
## Mean :11.29 Mean : 97.06 Mean :10.43
## 3rd Qu.:15.50 3rd Qu.: 96.50 3rd Qu.:15.50
## Max. :44.00 Max. :944.00 Max. :39.00
## gender reside tollfree tollten
## Length:119 Min. :1.00 Length:119 Min. : 23.05
## Class :character 1st Qu.:1.00 Class :character 1st Qu.: 318.80
## Mode :character Median :2.00 Mode :character Median : 851.70
## Mean :2.42 Mean :1051.35
## 3rd Qu.:3.00 3rd Qu.:1659.45
## Max. :6.00 Max. :4905.85
## equipten cardten wireten loglong
## Min. : 29.05 Min. : 5.0 Min. : 20.95 Min. :0.470

5
## 1st Qu.: 624.75 1st Qu.: 222.5 1st Qu.: 507.65 1st Qu.:1.691
## Median :1481.35 Median : 540.0 Median :1217.35 Median :2.116
## Mean :1590.61 Mean : 750.5 Mean :1595.79 Mean :2.183
## 3rd Qu.:2448.28 3rd Qu.:1022.5 3rd Qu.:2405.97 3rd Qu.:2.657
## Max. :4167.70 Max. :4975.0 Max. :6444.95 Max. :4.072
## logtoll logequi logcard logwire
## Min. :2.546 Min. :3.357 Min. :1.322 Min. :2.874
## 1st Qu.:3.002 1st Qu.:3.669 1st Qu.:2.536 1st Qu.:3.449
## Median :3.199 Median :3.764 Median :2.918 Median :3.696
## Mean :3.239 Mean :3.793 Mean :2.897 Mean :3.705
## 3rd Qu.:3.493 3rd Qu.:3.935 3rd Qu.:3.178 3rd Qu.:3.930
## Max. :4.208 Max. :4.353 Max. :4.241 Max. :4.698
## custcat churn
## Length:119 Length:119
## Class :character Class :character
## Mode :character Mode :character
##
##
##

View(data2)
"The above code locates all rows with NA values within the data frame and removes them from the
dataframe leaving you with fewer observations than before however all of them have all the variable data

## [1] "The above code locates all rows with NA values within the data frame and removes them from the\n

#3. Combining the 'region' and 'address' variables to create one comprehensive locator variable 'Address
data3 = data %>% replace(is.na(.), 'x')
View(data3)
data3 = data.frame(unite(data3, col = "Address", c('region', 'address'), sep = ", "))
#View(Address)
"The above code firstly creates a new dataframe where all the NA values have been replaced with the char

## [1] "The above code firstly creates a new dataframe where all the NA values have been replaced with t

#4. Subsetting the Data to focus on a specific demographic of responses


data5 = data[data[, 4] == "Married", ]
data5 = na.omit(data5)
View(data5)
"The above code creates a subset of the initial dataframe by scanning the 'marital' column of the datafr

## [1] "The above code creates a subset of the initial dataframe by scanning the ’marital’ column of the

#5. Removing a column from the Data due to high frequency of missing values / poor data quality
data %>% missing_plot()

6
Missing values map
region
tenure
age
marital
address
income
employ
retire
gender
reside
tollfree
tollten
equipten
cardten
wireten
loglong
logtoll
logequi
logcard
logwire
custcat
churn
0 250 500 750 1000
Observation

na_sum = colSums(is.na(data))
print(na_sum)

## region tenure age marital address income employ retire


## 7 8 5 9 8 8 7 8
## gender reside tollfree tollten equipten cardten wireten loglong
## 9 9 8 8 8 7 8 7
## logtoll logequi logcard logwire custcat churn
## 525 614 322 704 8 9

"The 'logwire' colun has the highest frequency of NA values hence will be the column to be removed"

## [1] "The ’logwire’ colun has the highest frequency of NA values hence will be the column to be remove

data6 = data[ , colnames(data) != "logwire"]


colnames(data)

## [1] "region" "tenure" "age" "marital" "address" "income"


## [7] "employ" "retire" "gender" "reside" "tollfree" "tollten"
## [13] "equipten" "cardten" "wireten" "loglong" "logtoll" "logequi"
## [19] "logcard" "logwire" "custcat" "churn"

7
colnames(data6)

## [1] "region" "tenure" "age" "marital" "address" "income"


## [7] "employ" "retire" "gender" "reside" "tollfree" "tollten"
## [13] "equipten" "cardten" "wireten" "loglong" "logtoll" "logequi"
## [19] "logcard" "custcat" "churn"

"The above code firstyl creates a visualisation of the location of NA values throughout the dataset so a

## [1] "The above code firstyl creates a visualisation of the location of NA values throughout the datas

Consider any of the newly created data in 1(a) above. Describe the data being
used

" Considering 'data5' from the above question; the data is a subset of the parent Telecommunication' dat
only contains entries from respondents who are married irregardless of age or income level"

## [1] " Considering ’data5’ from the above question; the data is a subset of the parent Telecommunicati

Provide appropriate descriptive summaries of any two variables using the chosen

newly created data. Also provide interpretation

"Using data5"

## [1] "Using data5"

#1. Dscriptive summary of 'income' variable


income = data5$income
summary(income)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 15.00 34.00 55.00 88.64 80.00 591.00

#range(income)
#table(income)
histogram1 = hist(income, xlab = 'Income', ylab = '# of Respondents', main = 'Histogram of Income of Mar

8
Histogram of Income of Married respondents
50
40
# of Respondents

30
20
10
0

0 100 200 300 400 500 600

Income

#plot(income)
boxplot1 = boxplot(income, main = 'Boxplot of Income of Married respondents', ylab = 'Income')

9
Boxplot of Income of Married respondents
100 200 300 400 500 600
Income

boxplot1

## $stats
## [,1]
## [1,] 15
## [2,] 34
## [3,] 55
## [4,] 80
## [5,] 145
##
## $n
## [1] 61
##
## $conf
## [,1]
## [1,] 45.69428
## [2,] 64.30572
##
## $out
## [1] 163 359 301 294 228 163 591 262 256 162
##
## $group
## [1] 1 1 1 1 1 1 1 1 1 1
##
## $names
## [1] ""

10
histogram1

## $breaks
## [1] 0 100 200 300 400 500 600
##
## $counts
## [1] 49 5 4 2 0 1
##
## $density
## [1] 0.0080327869 0.0008196721 0.0006557377 0.0003278689 0.0000000000
## [6] 0.0001639344
##
## $mids
## [1] 50 150 250 350 450 550
##
## $xname
## [1] "income"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"

table(data5$income > 300)

##
## FALSE TRUE
## 58 3

"From the above analysis it is clear that among the Married respondents, income is not normally distribu

## [1] "From the above analysis it is clear that among the Married respondents, income is not normally d

#2. Descriptive summary of 'gender' variable


malesum = table(data5$gender)['Male']
femalesum = table(data5$gender)['Female']
malesum

## Male
## 29

femalesum

## Female
## 32

sumsum = femalesum + malesum


femaleprop = (femalesum / sumsum) * 100
maleprop = (malesum / sumsum) * 100
femaleprop

11
## Female
## 52.45902

maleprop

## Male
## 47.54098

"From the above analysis it is clear to see that, among the married respondents their is very little gen

## [1] "From the above analysis it is clear to see that, among the married respondents their is very lit

12

You might also like