0% found this document useful (0 votes)

37 views9 pages

21NKU14 - Preprosessing Assignment

The document outlines 6 steps for pre-processing a dataset: 1) Understanding the data by viewing dimensions, column names, and structure; 2) Looking at the data by viewing the first/last rows; 3) Visualizing the data with bar plots and box plots; 4) Dealing with outliers by capping extreme values; 5) Dealing with missing values by imputing the mean for total profit; and 6) Scaling features like total profit for machine learning algorithms.

Uploaded by

S.K. Praveen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views9 pages

21NKU14 - Preprosessing Assignment

Uploaded by

S.K. Praveen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

“PRE-PROCESSING STEPS”

STEP 1: UNDERSTANDING THE DATA:

1.VIEW THE DIMENSIONS

> class(X100_Data)

[1] "spec_tbl_df" "tbl_df" "tbl"

[4] "data.frame"

> dim(X100_Data)

[1] 100 14

Dimension was viewed as a number of rows and columns as 100 and 14 respectively

> nrow(X100_Data)

[1] 100

Number of Rows alone viewed with this command

> ncol(X100_Data)

[1] 14

The number of Columns alone viewed with this command

2.VIEW THE COLUMN NAMES

> names(X100_Data)

[1] "Region" "Country" "Item Type"

[4] "Sales Channel" "Order Priority" "Order Date"

[7] "Order ID" "Ship Date" "Units Sold"

[10] "Unit Price" "Unit Cost" "Total Revenue"

[13] "Total Cost" "Total Profit"

It shows the name of the headings in the dataset table. 14 headings = 14 columns

3.STRUCTURE OF THE DATASET

> str(X100_Data)

spec_tbl_df [100 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

$ Region : chr [1:100] "Australia and Oceania" "Central America and the Caribbean"
"Europe" "Sub-Saharan Africa" ...

$ Country : chr [1:100] "Tuvalu" "Grenada" "Russia" "Sao Tome and Principe" ...

$ Item Type : chr [1:100] "Baby Food" "Cereal" "Office Supplies" "Fruits" ...

$ Sales Channel : chr [1:100] "Offline" "Online" "Offline" "Online" ...

$ Order Priority: chr [1:100] "H" "C" "L" "C" ...

$ Order Date : chr [1:100] "5/28/2010" "8/22/2012" "05-02-2014" "6/20/2014" ...

$ Order ID : num [1:100] 6.69e+08 9.64e+08 3.41e+08 5.14e+08 1.15e+08 ...

$ Ship Date : chr [1:100] "6/27/2010" "9/15/2012" "05-08-2014" "07-05-2014" ...

$ Units Sold : num [1:100] 9925 2804 1779 8102 5062 ...

$ Unit Price : num [1:100] 255.28 205.7 651.21 9.33 651.21 ...

$ Unit Cost : num [1:100] 159.42 117.11 524.96 6.92 524.96 ...

$ Total Revenue : num [1:100] 2533654 576783 1158503 75592 3296425 ...

$ Total Cost : num [1:100] 1582244 328376 933904 56066 2657348 ...

$ Total Profit : num [1:100] 951411 248406 224599 19526 639078 ...

- attr(*, "spec")=

.. cols(

.. Region = col_character(),

.. Country = col_character(),

.. `Item Type` = col_character(),

.. `Sales Channel` = col_character(),

.. `Order Priority` = col_character(),

.. `Order Date` = col_character(),

.. `Order ID` = col_double(),

.. `Ship Date` = col_character(),

.. `Units Sold` = col_double(),

.. `Unit Price` = col_double(),

.. `Unit Cost` = col_double(),

.. `Total Revenue` = col_double(),

.. `Total Cost` = col_double(),

.. `Total Profit` = col_double()

.. )

- attr(*, "problems")=<externalptr>

The structure of the dataset gave the character and numeric differentiations in the table. Like
Unit price is structured as numeric such as 159.42 and the Sales channel is structured as a
character such as offline.

> summary(X100_Data)

Region Country

Length:100 Length:100

Class :character Class :character

Mode :character Mode :character

Item Type Sales Channel

Length:100 Length:100

Class :character Class :character

Mode :character Mode :character

Order Priority Order Date

Length:100 Length:100
Class :character Class :character

Mode :character Mode :character

Order ID Ship Date Units Sold

Min. :114606559 Length:100 Min. : 124

1st Qu.:338922488 Class :character 1st Qu.:2836

Median :557708561 Mode :character Median :5382

Mean :555020412 Mean :5129

3rd Qu.:790755081 3rd Qu.:7369

Max. :994022214 Max. :9925

It gave the data summary with the minimum values, 1st and 3rd quartile values, Mean, median
and mode, maximum values, and the count of not available values.

STEP 2: LOOKING AT THE DATA:

> head(X100_Data)

> head(X100_Data,n=15)
It shows the top 6 rows by default. When typing the command head (X100_Data, n=15), shows
the first 15 rows.

> tail(X100_Data)

It shows the bottom 6 rows by default.

STEP 3: VISUALIZING THE DATA:

1.Bar plot:

> Region<-table(X100_Data$Region)

> Region

> barplot(Regiontable,col = c("violet","Red","Green","black","Blue","Yellow","Orange"),ylab =

"Region")

2.Box plot:

> par(mfrow=c(1,2))

> boxplot(X100_Data$`Total Profit`)

> par(mfrow=c(1,1))

Outliers occurred.

STEP 4: DEALING WITH OUTLIERS:

> X100_Data1<-X100_Data

> boxplot(X100_Data$`Total Profit`)

> boxplot(X100_Data$`Total Profit`[X100_Data$`Total Profit`<1000000])

> boxplot(X100_Data$`Total Profit`, horizontal = TRUE)

> attach(X100_Data)

> x<-`Total Profit`

> qnt<-quantile(x,probs = c(.25,.75),na.rm=T)

> caps<-quantile(x,probs=c(.05,.95),na.rm=T)
> H <- 1.5 * IQR(x, na.rm = T)

> x[x < (qnt[1] - H)] <- caps[1]

> x[x > (qnt[2] + H)] <- caps[2]

> `Total Profit`<-x

> boxplot(`Total Profit`,main="Boxplot of Total Profit",horizontal=TRUE,col='Grey')

STEP 5: DEALING WITH MISSING VALUES:

> data("X100_Data")

> any(is.na(X100_Data[]))

[1] FALSE

> sum(is.na(X100_Data[]))

[1] 0

> colSums(is.na(X100_Data[]))

> nrow(X100_Data)

[1] 100

> nrow(X100_Data1)
[1] 100

> m=mean(X100_Data1$`Total Profit`[!is.na(X100_Data1$`Total Profit`)])

[1] 441682

The non-available values are detected from the above diagram, and code is done for total profit
and the value was included in the dataset table.

STEP 6: SCALING THE FEATURES:

1.To display the data in vector X:

> x<-X100_Data$`Total Profit`

>x
It can be an important pre-processing step for many machine-learning algorithms

> scale(x)

The center parameter takes either a numeric alike vector or logical value.

1000 Mcqs With Answers 3
100% (2)
1000 Mcqs With Answers 3
195 pages
The Search For Consistent Intonation - An Exploration and Guide Fo
No ratings yet
The Search For Consistent Intonation - An Exploration and Guide Fo
136 pages
Infectious Diseases: Curtis L. Smith, Pharm.D., BCPS
100% (2)
Infectious Diseases: Curtis L. Smith, Pharm.D., BCPS
48 pages
Praise The Humble Dung Beetle Thesis
100% (2)
Praise The Humble Dung Beetle Thesis
5 pages
Electro Intext
No ratings yet
Electro Intext
8 pages
Elcott 2020 Epstein Master of Time
No ratings yet
Elcott 2020 Epstein Master of Time
11 pages
Spark 1 Test - Unit 4 Name - Mark
100% (4)
Spark 1 Test - Unit 4 Name - Mark
3 pages
Vin Dicarlo Attraction Code Sales Letter Draft 3.1
No ratings yet
Vin Dicarlo Attraction Code Sales Letter Draft 3.1
20 pages
Grade 11 - Oral Comm Midterm Exam
No ratings yet
Grade 11 - Oral Comm Midterm Exam
5 pages
100 Mixed Pronoun Sentences 88726
No ratings yet
100 Mixed Pronoun Sentences 88726
5 pages
Critical Period Hypothesis
No ratings yet
Critical Period Hypothesis
9 pages
1920 - Book of Uncle Silas
No ratings yet
1920 - Book of Uncle Silas
14 pages
NSTP Common Modules
No ratings yet
NSTP Common Modules
90 pages
The Birth of Aglipayanism and The Split of Philippine Catholicism
No ratings yet
The Birth of Aglipayanism and The Split of Philippine Catholicism
16 pages
Basic Unit 1
100% (1)
Basic Unit 1
86 pages
Macmillan English 6 Unit 15 Worksheet Teaching Notes: Change The Nouns in The Box To Adjectives
No ratings yet
Macmillan English 6 Unit 15 Worksheet Teaching Notes: Change The Nouns in The Box To Adjectives
2 pages
Activity 14
No ratings yet
Activity 14
80 pages
Geometric & Harmonic Means
No ratings yet
Geometric & Harmonic Means
117 pages
Pega Certification Bullet Points
No ratings yet
Pega Certification Bullet Points
61 pages
Absolute Dating Essay.
No ratings yet
Absolute Dating Essay.
5 pages
Thì Quá Khứ Tiếp Diễn
No ratings yet
Thì Quá Khứ Tiếp Diễn
9 pages
Macalintal v. Presidential Electoral Tribunal
No ratings yet
Macalintal v. Presidential Electoral Tribunal
2 pages
Studentstressinventorissienglishedition Profmas2021
No ratings yet
Studentstressinventorissienglishedition Profmas2021
8 pages
English 7 45 Copies
No ratings yet
English 7 45 Copies
4 pages
Perdev Week 1 3
No ratings yet
Perdev Week 1 3
24 pages
Experiment No. 3
No ratings yet
Experiment No. 3
6 pages
Self-Assessment 2 Virtual Tools
No ratings yet
Self-Assessment 2 Virtual Tools
34 pages
Cell and Tissue Culture
No ratings yet
Cell and Tissue Culture
16 pages
Hilario Vs IAC - Digest
No ratings yet
Hilario Vs IAC - Digest
1 page
Maori Haka
No ratings yet
Maori Haka
12 pages