“PRE-PROCESSING STEPS”
STEP 1: UNDERSTANDING THE DATA:
1.VIEW THE DIMENSIONS
> class(X100_Data)
[1] "spec_tbl_df" "tbl_df" "tbl"
[4] "data.frame"
> dim(X100_Data)
[1] 100 14
Dimension was viewed as a number of rows and columns as 100 and 14 respectively
> nrow(X100_Data)
[1] 100
Number of Rows alone viewed with this command
> ncol(X100_Data)
[1] 14
The number of Columns alone viewed with this command
2.VIEW THE COLUMN NAMES
> names(X100_Data)
[1] "Region" "Country" "Item Type"
[4] "Sales Channel" "Order Priority" "Order Date"
[7] "Order ID" "Ship Date" "Units Sold"
[10] "Unit Price" "Unit Cost" "Total Revenue"
[13] "Total Cost" "Total Profit"
It shows the name of the headings in the dataset table. 14 headings = 14 columns
3.STRUCTURE OF THE DATASET
> str(X100_Data)
spec_tbl_df [100 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Region : chr [1:100] "Australia and Oceania" "Central America and the Caribbean"
"Europe" "Sub-Saharan Africa" ...
$ Country : chr [1:100] "Tuvalu" "Grenada" "Russia" "Sao Tome and Principe" ...
$ Item Type : chr [1:100] "Baby Food" "Cereal" "Office Supplies" "Fruits" ...
$ Sales Channel : chr [1:100] "Offline" "Online" "Offline" "Online" ...
$ Order Priority: chr [1:100] "H" "C" "L" "C" ...
$ Order Date : chr [1:100] "5/28/2010" "8/22/2012" "05-02-2014" "6/20/2014" ...
$ Order ID : num [1:100] 6.69e+08 9.64e+08 3.41e+08 5.14e+08 1.15e+08 ...
$ Ship Date : chr [1:100] "6/27/2010" "9/15/2012" "05-08-2014" "07-05-2014" ...
$ Units Sold : num [1:100] 9925 2804 1779 8102 5062 ...
$ Unit Price : num [1:100] 255.28 205.7 651.21 9.33 651.21 ...
$ Unit Cost : num [1:100] 159.42 117.11 524.96 6.92 524.96 ...
$ Total Revenue : num [1:100] 2533654 576783 1158503 75592 3296425 ...
$ Total Cost : num [1:100] 1582244 328376 933904 56066 2657348 ...
$ Total Profit : num [1:100] 951411 248406 224599 19526 639078 ...
- attr(*, "spec")=
.. cols(
.. Region = col_character(),
.. Country = col_character(),
.. `Item Type` = col_character(),
.. `Sales Channel` = col_character(),
.. `Order Priority` = col_character(),
.. `Order Date` = col_character(),
.. `Order ID` = col_double(),
.. `Ship Date` = col_character(),
.. `Units Sold` = col_double(),
.. `Unit Price` = col_double(),
.. `Unit Cost` = col_double(),
.. `Total Revenue` = col_double(),
.. `Total Cost` = col_double(),
.. `Total Profit` = col_double()
.. )
- attr(*, "problems")=<externalptr>
The structure of the dataset gave the character and numeric differentiations in the table. Like
Unit price is structured as numeric such as 159.42 and the Sales channel is structured as a
character such as offline.
> summary(X100_Data)
Region Country
Length:100 Length:100
Class :character Class :character
Mode :character Mode :character
Item Type Sales Channel
Length:100 Length:100
Class :character Class :character
Mode :character Mode :character
Order Priority Order Date
Length:100 Length:100
Class :character Class :character
Mode :character Mode :character
Order ID Ship Date Units Sold
Min. :114606559 Length:100 Min. : 124
1st Qu.:338922488 Class :character 1st Qu.:2836
Median :557708561 Mode :character Median :5382
Mean :555020412 Mean :5129
3rd Qu.:790755081 3rd Qu.:7369
Max. :994022214 Max. :9925
It gave the data summary with the minimum values, 1st and 3rd quartile values, Mean, median
and mode, maximum values, and the count of not available values.
STEP 2: LOOKING AT THE DATA:
> head(X100_Data)
> head(X100_Data,n=15)
It shows the top 6 rows by default. When typing the command head (X100_Data, n=15), shows
the first 15 rows.
> tail(X100_Data)
It shows the bottom 6 rows by default.
STEP 3: VISUALIZING THE DATA:
1.Bar plot:
> Region<-table(X100_Data$Region)
> Region
> barplot(Regiontable,col = c("violet","Red","Green","black","Blue","Yellow","Orange"),ylab =
"Region")
2.Box plot:
> par(mfrow=c(1,2))
> boxplot(X100_Data$`Total Profit`)
> par(mfrow=c(1,1))
Outliers occurred.
STEP 4: DEALING WITH OUTLIERS:
> X100_Data1<-X100_Data
> boxplot(X100_Data$`Total Profit`)
> boxplot(X100_Data$`Total Profit`[X100_Data$`Total Profit`<1000000])
> boxplot(X100_Data$`Total Profit`, horizontal = TRUE)
> attach(X100_Data)
> x<-`Total Profit`
> qnt<-quantile(x,probs = c(.25,.75),na.rm=T)
> caps<-quantile(x,probs=c(.05,.95),na.rm=T)
> H <- 1.5 * IQR(x, na.rm = T)
> x[x < (qnt[1] - H)] <- caps[1]
> x[x > (qnt[2] + H)] <- caps[2]
> `Total Profit`<-x
> boxplot(`Total Profit`,main="Boxplot of Total Profit",horizontal=TRUE,col='Grey')
STEP 5: DEALING WITH MISSING VALUES:
> data("X100_Data")
> any(is.na(X100_Data[]))
[1] FALSE
> sum(is.na(X100_Data[]))
[1] 0
> colSums(is.na(X100_Data[]))
> nrow(X100_Data)
[1] 100
> nrow(X100_Data1)
[1] 100
> m=mean(X100_Data1$`Total Profit`[!is.na(X100_Data1$`Total Profit`)])
[1] 441682
The non-available values are detected from the above diagram, and code is done for total profit
and the value was included in the dataset table.
STEP 6: SCALING THE FEATURES:
1.To display the data in vector X:
> x<-X100_Data$`Total Profit`
>x
It can be an important pre-processing step for many machine-learning algorithms
> scale(x)
The center parameter takes either a numeric alike vector or logical value.