0% found this document useful (0 votes)
621 views

A Complete Tutorial To Learn Data Science in R From Scratch

Huong dan co ban ve R

Uploaded by

meegoos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
621 views

A Complete Tutorial To Learn Data Science in R From Scratch

Huong dan co ban ve R

Uploaded by

meegoos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://www.facebook.com/AnalyticsVidhya)

(https://twitter.com/analyticsvidhya)

(https://plus.google.com/+Analyticsvidhya/posts)
(https://www.linkedin.com/groups/Analytics-Vidhya-Learn-everything-about-5057165)

(https://www.analyticsvidhya.com)

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Home (https://www.analyticsvidhya.com/) Business Analytics (https://www.analyticsvidhya.com/blog/category/business-analytic...

A Complete Tutorial to learn Data Science in R from


Scratch
BUSINESS ANALYTICS (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/BUSINESS-ANALYTICS/)
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/MACHINE-LEARNING/)

MACHINE LEARNING

R (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/R/)

www.facebook.com/sharer.php?u=https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-scienceomplete%20Tutorial%20to%20learn%20Data%20Science%20in%20R%20from%20Scratch)

(https://twitter.com/home?

lete%20Tutorial%20to%20learn%20Data%20Science%20in%20R%20from%20Scratch+https://www.analyticsvidhya.com/blog/2016/02/complete-

science-scratch/)

(https://plus.google.com/share?url=https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-science-

p://pinterest.com/pin/create/button/?url=https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-sciencetps://www.analyticsvidhya.com/wp-content/uploads/2016/02/graphics-

A%20Complete%20Tutorial%20to%20learn%20Data%20Science%20in%20R%20from%20Scratch)

Drive revenue
app
advertising
Instant
access tothrough
millions of
Google
advertisers

Introduction

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

1/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

R is a powerful language used widely for data analysis and statistical computing. It was developed in
early 90s. Since then, endless e orts have been made to improve Rs user interface. The journey of R
language from a rudimentary text editor to interactiveR Studio and more recentlyJupyter Notebooks
(http://discuss.analyticsvidhya.com/t/how-to-run-r-on-jupyter-ipython-notebooks/5512)

has

engaged many data science communities across the world.


This was possible only because of generous contributions by R users globally. Inclusion of powerful
packages in R has made it more and more powerful with time. Packages such as dplyr, tidyr, readr,
data.table, SparkR, ggplot2 have made data manipulation, visualization and computation much faster.
But, what about Machine Learning ?
My rst impression of R was that its just a software for statistical computing. Good thing, I was wrong!
R has enough provisions to implement machine learning algorithms in a fast and simplemanner.
This is a complete tutorial to learn data science and machine learning using R. By the end of this
tutorial, you will have a good exposure to building predictive models using machine learning on your
own.
Note: No prior knowledge of data science / analytics is required. However, prior knowledge of
algebra and statistics will be helpful.

Table of Contents
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

2/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Table of Contents
1. Basics of R Programming for Data Science
Why learn R ?
How to install R / R Studio ?
How to install R packages ?
Basic computations in R
2. Essentials ofR Programming
Data Types and Objects in R
Control Structures (Functions)in R
Useful R Packages
3. Exploratory Data Analysis in R
Basic Graphs
Treating Missing values
Working with Continuous and Categorical Variables
4. Data Manipulation in R
Feature Engineering
Label Encoding / One Hot Encoding
5. Predictive Modeling using Machine Learning in R
Linear Regression
Decision Tree
Random Forest

Lets get started !

Note: The data set used in this article is from Big Mart Sales Prediction
(http://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii).

1. Basics of R Programming
Why learn R ?
I dont know if I have a solid reason to convince you, but let me share what got me started. I have no
prior coding experience. Actually, I never had computer science inmy subjects. I came toknow that
to learn data science, one must learn either R or Python as a starter. I chose the former. Here are
some bene ts I found after using R:

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

3/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

1. The style of coding is quite easy.


2. Its open source. No need to pay any subscription charges.
3. Availability ofinstant access toover 7800 packages customized for various computation tasks.
4. The community support is overwhelming. There are numerous forums to help you out.
5. Gethigh performance computing experience ( require packages)
6. One of highly sought skill by analytics and data science companies.

There are many more bene ts. But, these are the ones which have kept me going. If you think they
are exciting, stick around and move to next section. And, if you arent convinced, you may like
Complete Python Tutorial from Scratch (https://www.analyticsvidhya.com/blog/2016/01/completetutorial-learn-data-science-python-scratch-2/).

How to install R / R Studio ?


You could download and install the old version (http://ftp.heanet.ie/mirrors/cran.r-project.org/) of R.
But, Id insist you to start with RStudio. It provides much better coding experience. For Windows users,
R Studio is available for Windows Vista and above versions. Follow the steps below for installing R
Studio:
1. Go to https://www.rstudio.com/products/rstudio/download/
2. In Installers for Supported Platforms section, choose and click the R Studio installer based on your
operating system. The download should begin as soon as you click.
3. Click Next..Next..Finish.
4. Download Complete.
5. To Start R Studio, click on its desktop icon or use search windows to access the program. It looks like
this:

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

4/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Lets quickly understand the interface of R Studio:


1. R Console: This area shows the output of code you run. Also, you can directly write codes in console.
Code entered directly in R console cannot be traced later. This is where R script comes to use.
2. R Script: As the name suggest, here you get space to write codes. To run those codes, simply select
the line(s) of code and press Ctrl + Enter. Alternatively, you can click on little Run button location at top
right corner of R Script.
3. R environment: This space displays the set of external elements added. This includes data set,
variables, vectors, functions etc. To check if data has been loaded properly in R, always look at this
area.
4. Graphical Output: This space display the graphs created during exploratory data analysis. Not just
graphs, you could select packages, seek help withembedded Rs o cial documentation.

How to install R Packages ?

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

5/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

The sheer power of R lies in its incredible packages. In R, most data handling tasks can be performed
in 2 ways: Using R packages and R base functions. In this tutorial, Ill also introduce you with themost
handyand powerful R packages. To install a package, simply type:
install.packages("packagename")

As a

rst time user, a pop might appear to select your CRAN mirror (country server), choose

accordingly and press OK.


Note: You can type this either in console directly and press Enter or in R script and click Run.

Basic Computations in R
Lets begin with basics. To get familiar with R coding environment, start with some basic calculations.
R console can be used as an interactive calculator too. Type the following in your console:
>2+3
>5
>6/3
>2
>(3*8)/(2*3)
>4
>log(12)
>1.07
>sqrt(121)
>11

Similarly, you can experiment various combinations of calculations and get the results. In case, you
want to obtain the previous calculation, this can be done in two ways. First, click in R console, and
press Up / Down Arrow key on your keyboard. This will activate the previously executed commands.
Press Enter.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

6/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

But, what if you have done too many calculations ? It would be too painful to scroll through every
command and nd it out. In such situations, creating variable is a helpful way.
In R, you can create a variable using <- or = sign. Lets say I want to create a variable x to compute the
sum of 7 and 8. Ill write it as:
>x<8+7
>x
>15

Once we create a variable, you no longer get the output directly (like calculator), unless you call the
variable in the next line. Remember, variables can be alphabets, alphanumeric but not numeric. You
cant create numeric variables.

2. Essentials ofR Programming


Understand and practice this section thoroughly. This is the building block of your R programming
knowledge. If you get this right, you would face less trouble in debugging.
R has ve basic or atomic classes of objects. Wait, what is an object ?
Everything you see or create in R is an object. A vector, matrix, data frame, even a variable is an
object. R treats it that way. So, R has 5 basic classesof objects. This includes:
1. Character
2. Numeric (Real Numbers)
3. Integer (Whole Numbers)
4. Complex
5. Logical (True / False)

Since these classes are self-explanatory by names, I wouldnt elaborate on that. These classes have
attributes. Think of attributes as their identi er, a name or number which aptly identi es them. An
object can have following attributes:
1. names, dimension names
2. dimensions
3. class

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

7/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

4. length

Attributes of an object can be accessed using attributes() function. More on this coming in following
section.
Lets understand the concept of object and attributes practically. The most basic object in R is known
as vector. You can create an empty vector using vector(). Remember, a vector contains object of same
class.
For example: Lets create vectors of di erent classes. We can create vector using c() or concatenate
command also.
>a<c(1.8,4.5)#numeric
>b<c(1+2i,36i)#complex
>d<c(23,44)#integer
>e<vector("logical",length=5)

Similarly, you can create vector of various classes.

Data Types in R
R has various type of data types whichincludes vector (numeric, integer etc), matrices, data frames
and list. Lets understand them one by one.
Vector: As mentioned above, a vector contains object of same class. But, you can mix objects of
di erent classes too.When objects of di erent classes are mixed in a list, coercion occurs. This e ect
causes the objects of di erent types to convert into one class. For example:
>qt<c("Time",24,"October",TRUE,3.33)#character
>ab<c(TRUE,24)#numeric
>cd<c(2.5,"May")#character

To check the class of any object, use class(vector name) function.


>class(qt)
"character"

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

8/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

To convert the class of a vector, you can use as. command.


>bar<0:5
>class(bar)
>"integer"
>as.numeric(bar)
>class(bar)
>"numeric"
>as.character(bar)
>class(bar)
>"character"

Similarly, you can change the class of any vector. But, you should pay attention here. If you try to
convert a character vector to numeric , NAs will be introduced. Hence, you should be careful to
use this command.

List: A list is a special type of vector which contain elements of di erent data types. For example:
>my_list<list(22,"ab",TRUE,1+2i)
>my_list
[[1]]
[1]22
[[2]]
[1]"ab"
[[3]]
[1]TRUE
[[4]]
[1]1+2i

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

9/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

As you can see, the output of a list is di erent from a vector. This is because, all the objects are of
di erent types. The double bracket [[1]] shows the index of rst element and so on. Hence, you can
easily extract the element of lists depending on their index. Like this:
>my_list[[3]]
>[1]TRUE

You can use [] single bracket too. But, that would return the list element with its index number, instead
of the result above. Like this:
>my_list[3]
>[[1]]
[1]TRUE

Matrices: When a vector is introduced with row and column i.e. a dimension attribute, it becomes a
matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data structure. It
consist of elements of same class. Lets create a matrix of 3 rows and 2 columns:
>my_matrix<matrix(1:6,nrow=3,ncol=2)
>my_matrix
[,1][,2]
[1,]14
[2,]25
[3,]36

> dim(my_matrix)
[1] 3 2
> attributes(my_matrix)
$dim
[1] 3 2
As you can see, the dimensions of a matrix can be obtained using either dim()or attributes()
command. To extract a particular element from a matrix, simply use the index shown above. For
example(try this at your end):

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

10/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

>my_matrix[,2]#extractssecondcolumn
>my_matrix[,1]#extractsfirstcolumn
>my_matrix[2,]#extractssecondrow
>my_matrix[1,]#extractsfirstrow

As an interesting fact, you can also create a matrix from a vector. All you need to do is, assign
dimension dim() later. Like this:
>age<c(23,44,15,12,31,16)
>age
[1]234415123116
>dim(age)<c(2,3)
>age
[,1][,2][,3]
[1,]231531
[2,]441216
>class(age)
[1]"matrix"

You can also join two vectors using cbind() and rbind() functions. But, make sure that both vectors
have same number of elements. If not, it will return NA values.
>x<c(1,2,3,4,5,6)
>y<c(20,30,40,50,60)
>cbind(x,y)
>cbind(x,y)
xy
[1,]120
[2,]230
[3,]340
[4,]450
[5,]560
[6,]670

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

11/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

> class(cbind(x, y))


[1] matrix

Data Frame: This is the most commonly usedmember of data types family. It is used to store tabular
data. It is di erent from matrix. In a matrix, every element must have same class. But, in a data frame,
you can put list of vectors containing di erent classes. This means, every column of a data frame acts
like a list. Every time you will readdata in R, it will be stored in the form of a data frame. Hence, it is
important to understand the majorly used commands on data frame:
>df<data.frame(name=c("ash","jane","paul","mark"),score=c(67,56,87,91))
>df
namescore
1ash67
2jane56
3paul87
4mark91
>dim(df)
[1]42
>str(df)
'data.frame':4obs.of2variables:
$name:Factorw/4levels"ash","jane","mark",..:1243
$score:num67568791
>nrow(df)
[1]4
>ncol(df)
[1]2

Lets understand the code above. df is the name of data frame. dim() returns the dimension of data
frame as 4 rows and 2 columns. str() returns the structure of a data frame i.e. the list of variables
stored in the data frame. nrow() and ncol() return the number of rows and number of columns in a

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

12/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

data set respectively.


Here you see name is a factor variable and score is numeric.In data science, a variable can be
categorized into two types: Continuous and Categorical.
Continuous variables are those which can take any form such as 1, 2, 3.5, 4.66 etc. Categorical
variables are those which takes only discrete values such as 2, 5, 11, 15 etc. In R, categorical values
are represented by factors. In df, name is a factor variable having 4 unique levels. Factor or categorical
variable

are

specially

treated

in

data

set.

For

more

explanation,

click

here

(https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variablespredictive-modeling/). Similarly, you can

nd techniques to deal with continuous variables here

(https://www.analyticsvidhya.com/blog/2015/11/8-ways-deal-continuous-variables-predictivemodeling/).
Lets now understand the concept of missing values in R. This is one of the most painful yet crucial
part of predictive modeling. You must be aware of all techniques to deal with them. The complete
explanation

on

such

techniques

is

provided

here

(https://www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-buildingmodel-part-2/).
Missing values in R are represented by NA and NaN. Now well check if a data set has missing values
(using the same data frame df).
>df[1:2,2]<NA#injectingNAat1st,2ndrowand2ndcolumnofdf
>df
namescore
1ashNA
2janeNA
3paul87
4mark91
>is.na(df)#checkstheentiredatasetforNAsandreturnlogicaloutput
namescore
[1,]FALSETRUE
[2,]FALSETRUE
[3,]FALSEFALSE

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

13/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

[4,]FALSEFALSE
>table(is.na(df))#returnsatableoflogicaloutput
FALSETRUE
62

> df[!complete.cases(df),] #returns the list of rows having missing values


name score
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
1 ash NA
2 jane NA
Missing values hinder normal calculations in a data set. For example, lets say, we want to compute
the mean of score. Since there are two missing values, it cant be done directly. Lets see:
mean(df$score)
[1]NA
>mean(df$score,na.rm=TRUE)
[1]89

The use
of na.rm = TRUE parameter tells R to ignore the NAs and compute the mean of remaining
(https:/
/datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
values in the selected column (score). To remove rows with NA values in a data frame, you can use

na.omit:
>new_df<na.omit(df)
>new_df
namescore
3paul87
4mark91

Control Structures in R
As the name suggest, a control structure controls the ow of code / commands written inside a
function. A function is a set of multiple commands written to automate a repetitive coding task.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

14/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

For example: You have 10 data sets. You want to nd the mean of Age column present in every data
set. This can be done in 2 ways: either you write the code to compute mean 10 times or you simply
create a function and pass the data set to it.
Lets understandthe control structures in R with simpleexamples:
if, else This structure is used to test a condition. Below is the syntax:

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
if(<condition>){
##dosomething
}else{
##dosomething
}

Example
#initializeavariable
N<10

(https://datahack.analyticsvidhya.com/contest/skilltest#checkifthisvariable*5is>40
machine-learning/)
if(N*5>40){
print("Thisiseasy!")
}else{
print("It'snoteasy!")
}
[1]"Thisiseasy!"

for This structure is used when a loop is to be executed xed number of times. It is commonly used
for iterating over the elements of an object (list, vector). Below is the syntax:
for(<searchcondition>){
#dosomething
}

Example

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

15/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

#initializeavector
y<c(99,45,34,65,76,23)
#printthefirst4numbersofthisvector
for(iin1:4){
print(y[i])
}
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
[1]99
[1]45
[1]34
[1]65

while It begins by testing a condition, and executes only if the conditionis found to be true. Once
the loop is executed, the condition is tested again. Hence, its necessary to alter the condition such
that the loop doesnt go in nity. Below is the syntax:
(https:/
/datahack.analyticsvidhya.com/contest/skilltest#initializeacondition
machine-learning/)
Age<12
#checkifageislessthan17
while(Age<17){
print(Age)
Age<Age+1#Oncetheloopisexecuted,thiscodebreakstheloop
}
[1]12

[1]13
[1]14
[1]15
[1]16

There are other control structures as well but are less frequently used than explained above. Those
structures are:
1. repeat It executes an in nite loop

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

16/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

2. break It breaks the execution of a loop


3. next It allows to skip an iteration in a loop
4. return It help to exit a function

Note: If you nd the section control structures di cult to understand, not to worry. R is supported by
various packages to compliment the work done by control structures.

(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Useful R Packages

Out of ~7800 packages listed on CRAN (https://cran.r-project.org/), Ive listed some of the most
powerful and commonly used packages in predictive modeling in this article. Since, Ive already
explained the method of installing packages, you can go ahead and install them now. Sooner or later
youll need them.
Importing Data:R o ers wide range of packages for importing data available in any format such as
.txt, .csv, .json, .sql etc. To import large les of data quickly, it is advisable to install and use data.table,

readr, RMySQL, sqldf, jsonlite.


(https://datahack.analyticsvidhya.com/contest/skilltestData Visualization: R has in built plotting commands as well. They are good to create simple graphs.
machine-learning/)

But, becomes complex when it comes to creating advanced graphics. Hence, you should install

ggplot2.
Data Manipulation: R has afantastic collection of packages for data manipulation. These packages
allows you to do basic & advanced computations quickly. These packages are dplyr, plyr, tidyr,

lubridate,

stringr.

Check

out

this

complete

(https://www.analyticsvidhya.com/blog/2015/12/faster-data-manipulation-7-packages/)

tutorial
on

manipulation packages in R.

data

Modeling / Machine Learning: For modeling, caret package in R is powerful enough to cater to
every need for creating machine learning model. However, you can install packages algorithms wise
such as randomForest, rpart, gbm etc

Note: Ive only mentioned the commonly used packages. You might like to check this interesting
infographic

(https://www.analyticsvidhya.com/blog/2015/08/list-r-packages-data-analysis/)

on

complete list of useful R packages.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

17/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Till here, you becamefamiliar with the basic work style in R and its associated components. From
next section, well begin with predictive modeling. But before you proceed. I want you to practice,
what youve learnt till here.
Practice Assignment: As a part of this assignment, install swirl package in package. Then type,

library(swirl) to initiate the package.And, complete this interactive R tutorial. If you have followed this
article /datahack.analyticsvidhya.com/contest/thethoroughly, this assignment should be an easy task for you!
(https:/
strategic-monk/)

3. Exploratory Data Analysis in R


From this section onwards, well dive deep into various stages of predictive modeling. Hence,make
sure you understand every aspect of this section. In case you nd anything di cult to understand, ask
me in the comments section below.
Data Exploration is a crucial stage of predictive model. You cant build great and practical models
unless you learn to explore the data from begin to end. This stage forms a concrete foundation for
data manipulation
(the very next stage). Lets understand it in R.
(https:/
/datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
In

this

tutorial,

Ive

taken

the

data

set

from

Big

Mart

Sales

(https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/).

Prediction
Before

we

start, you mustget familiar with these terms:


Response Variable (a.k.a Dependent Variable): In a data set, the response variable (y) is one on
which we make predictions. In this case, well predict Item_Outlet_Sales. (Refer to image shown
below)

Predictor Variable (a.k.a Independent Variable): In a data set, predictor variables (Xi)are those using
which the prediction is made on response variable. (Image below).

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

18/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

(https://www.analyticsvidhya.com/wp-content/uploads/2016/02/PRV.png)

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Train Data: The predictive model is always built on train data set. An intuitive way to identify the train
data is, that it always has the response variable included.
Test Data: Once the model is built, its accuracy is tested on test data. This data always contains less
number of observations than train data set. Also, it does not include response variable.
Right now, you should download the data set. Take a good look at train and test data. Cross check the
information shared above and then proceed.

Lets now begin with importing and exploring data.


#workingdirectory
path<".../Data/BigMartSales"
#setworkingdirectory
setwd(path)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

19/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

As a beginner, Ill advise you to keep the train and test

les in your working directly to avoid

unnecessary directory troubles. Once the directory is set, we can easily import the .csv les using
commands below.
#LoadDatasets
train<read.csv("Train_UWu5bXk.csv")
test<read.csv("Test_u94Q5KV.csv")
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)

In fact, even prior to loading data in R, its a good practice to look at the data in Excel. This helps in
strategizing the complete prediction modeling process. To check if the data set has been loaded
successfully, look at R environment. The data can be seen there. Lets explore the data quickly.
#checkdimesions(numberofrow&columns)indataset
>dim(train)
[1]852312
>dim(test)
[1]568111

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
We have 8523 rows and 12 columns in train data set and 5681 rows and 11 columns in data set. This

makes sense. Test data should always have one column less (mentioned above right?). Lets get
deeper in train data set now.
#checkthevariablesandtheirtypesintrain
>str(train)
'data.frame':8523obs.of12variables:

$Item_Identifier:Factorw/1559levels"DRA12","DRA24",..:157966311221298759697739
441991...
$Item_Weight:num9.35.9217.519.28.93...
$Item_Fat_Content:Factorw/5levels"LF","lowfat",..:3535355355...
$Item_Visibility:num0.0160.01930.016800...
$Item_Type:Factorw/16levels"BakingGoods",..:515117101141466...
$Item_MRP:num249.848.3141.6182.153.9...
$Outlet_Identifier:Factorw/10levels"OUT010","OUT013",..:104101242683...
$Outlet_Establishment_Year:int1999200919991998198720091987198520022007...
$Outlet_Size:Factorw/4levels"","High","Medium",..:3331232311...

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

20/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

$Outlet_Location_Type:Factorw/3levels"Tier1","Tier2",..:1313333322...
$Outlet_Type:Factorw/4levels"GroceryStore",..:2321232422...
$Item_Outlet_Sales:num37354432097732995.. .

Lets do some quick data exploration.


To begin
with, Ill rst check if this data has missing values. This can be done by using:
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
>table(is.na(train))
FALSETRUE
1008131463

In train data set, we have 1463 missing values. Lets check the variables in which these values are
missing. Its important to nd and locate these missing values. Many data scientists have repeatedly
advised beginners to pay close attention to missing value in data exploration stages.
>colSums(is.na(train))
Item_IdentifierItem_Weight
(https://datahack.analyticsvidhya.com/contest/skilltest01463
machine-learning/)
Item_Fat_ContentItem_Visibility
00
Item_TypeItem_MRP
00
Outlet_IdentifierOutlet_Establishment_Year
00
Outlet_SizeOutlet_Location_Type

00
Outlet_TypeItem_Outlet_Sales
00

Hence, we see that column Item_Weight has 1463 missing values. Lets getmore inferences from this
data.
>summary(train)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

21/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Here are some quick inferences drawn from variables in train data set:
1. Item_Fat_Content has mis-matched factor levels.
2. Minimum value of item_visibility is 0. Practically, this is not possible. If an item occupies shelf space in a
grocery store, it ought to have some visibility. Well treat all 0s as missing values.
3. Item_Weight has 1463 missing values (already explained above).
4. Outlet_Size has a unmatched factor levels.
(https://datahack.analyticsvidhya.com/contest/theThese inference will help us in treating these variable more accurately.
strategic-monk/)

Graphical Representation of Variables


Im sure you would understand these variables better when explained visually. Using graphs, we can
analyze the data in 2 ways: Univariate Analysis and Bivariate Analysis.
Univariate analysis is done with one variable. Bivariate analysis is done with two variables. Univariate
analysis is a lot easy to do. Hence, Ill skip that part here. Id recommend you to try it at your end. Lets
now experiment doing bivariate analysis and carve out hidden insights.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
For visualization, Ill use ggplot2 package. These graphs would help us understand the distribution

and frequency of variables in the data set.


>ggplot(train,aes(x=Item_Visibility,y=Item_Outlet_Sales))+geom_point(size=2.5,
color="navy")+xlab("ItemVisibility")+ylab("ItemOutletSales")+ggtitle("ItemVisibility
vsItemOutletSales")

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

22/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

We can see that majority of sales has been obtained from products having visibility less than 0.2. This
suggests that item_visibility < 2 must be an important factor in determining sales. Lets plot few more
interesting graphs and explore such hidden stories.
>ggplot(train,aes(Outlet_Identifier,Item_Outlet_Sales))+geom_bar(stat="identity",color
(https:/
/datahack.analyticsvidhya.com/contest/skilltest="purple")+theme(axis.text.x=element_text(angle=70,vjust=0.5,color="black"))+
machine-learning/)
ggtitle("OutletsvsTotalSales")+theme_bw()

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

23/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Here, we infer that OUT027 has contributed to majority of sales followed by OUT35. OUT10 and OUT19
have probably the least footfall, thereby contributing to the least outlet sales.
>ggplot(train,aes(Item_Type,Item_Outlet_Sales))+geom_bar(stat="identity")
+theme(axis.text.x=element_text(angle=70,vjust=0.5,color="navy"))+xlab("Item
Type")+ylab("ItemOutletSales")+ggtitle("ItemTypevsSales")

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

From this graph, we can infer that Fruits and Vegetables contribute to the highest amount of outlet
sales followed by snack foods and household products. This information can also be represented
using a box plot chart. The bene t of using a box plot is, you get to see the outlier and mean deviation
of corresponding levels of a variable (shown below).
>ggplot(train,aes(Item_Type,Item_MRP))+geom_boxplot()+ggtitle("BoxPlot")+

theme(axis.text.x=element_text(angle=70,vjust=0.5,color="red"))+xlab("ItemType")
+ylab("ItemMRP")+ggtitle("ItemTypevsItemMRP")

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

24/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

The black point you see, is an outlier. The mid line you see in the box, is the mean value of each item
type. To know more about boxplots, check this tutorial
(https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/).
Now, we
have an idea of the variables and their importance on response variable. Lets now move
(https:/
/datahack.analyticsvidhya.com/contest/skilltestback to where we started. Missing values. Now well impute the missing values.
machine-learning/)
We saw variable Item_Weight has missing values. Item_Weight is an continuous variable. Hence, in
this case we can impute missing values with mean / median of item_weight. These are the most
commonly used methods of imputing missing value. To explore other methods of this techniques,
check out this tutorial (https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/).
Lets rst combine the data sets. This will save our time as we dont need to write separate codes
for
train and test data sets. To combine the two data frames, we must make sure that they have equal
columns, which is not the case.
>dim(train)
[1]852312
>dim(test)
[1]568111

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

25/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Test data set has one less column (response variable). Lets rst add the column. We can give this
column any value. An intuitive approach would be to extract the mean value of sales from train data
set and use it as placeholder for test variable Item _Outlet_ Sales. Anyways, lets make it simple for
now. Ive taken a value 1. Now, well combine the data sets.
>test$Item_Outlet_Sales<1
>combi<rbind(train,test)
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Impute missing value by median. Im using median because it is known to be highly robust to outliers.
Moreover, for this problem, our evaluation metric is RMSE
(https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-errormetrics/)which is also highly a ected by outliers. Hence, median is better in this case.
>combi$Item_Weight[is.na(combi$Item_Weight)]<median(combi$Item_Weight,na.rm=TRUE)
>table(is.na(combi$Item_Weight))
FALSE
14204

(https://datahack.analyticsvidhya.com/contest/skilltest
machine-learning/)

Trouble with Continuous Variables & Categorical Variables


Its important to learn to deal with continuous and categorical variables separately in a data set. In
other words, they need special attention. In this data set, we have only 3 continuous variables and rest
are categorical in nature. If you are still confused, Ill suggest you to once again look at the data set
using str() and proceed.

Lets take up Item_Visibility. In the graph above, we saw item visibility has zero value also, which is
practically not feasible. Hence, well consider it as a missing value and once again make the
imputation using median.
>combi$Item_Visibility<ifelse(combi$Item_Visibility==0,
median(combi$Item_Visibility),combi$Item_Visibility)

Lets proceed to categorical variables now. During exploration, we saw there are mis-matched levels
in variables which needs to be corrected.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

26/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

>levels(combi$Outlet_Size)[1]<"Other"
>library(plyr)
>combi$Item_Fat_Content<revalue(combi$Item_Fat_Content,
c("LF"="LowFat","reg"="Regular"))
>combi$Item_Fat_Content<revalue(combi$Item_Fat_Content,c("lowfat"="LowFat"))
>table(combi$Item_Fat_Content)

(https:/
/datahack.analyticsvidhya.com/contest/theLowFatRegular
strategic-monk/)
91855019

Using the commands above, Ive assigned the name Other to unnamed level in Outlet_Size variable.
Rest, Ive simply renamed the various levels of Item_Fat_Content.

4. Data Manipulation in R
Lets call it as, the advanced level of data exploration. In this section well practically learn about
feature engineering and other useful aspects.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Feature Engineering: This component separates an intelligent data scientist from a technically

enabled data scientist. You might have access to large machines to run heavy computations and
algorithms, but the power delivered by new features, just cant be matched. We create new variables
to extract and provide as much new information to the model, to help it make accurate predictions.
If you have been thinking all this time, great. But now is the time to think deeper. Look at the data set
and ask yourself, what else (factor) could in uence Item_Outlet_Sales ? Anyhow, the answer is below.
But, I want you to try it out rst, before scrolling down.

1. Count of Outlet Identi ers There are 10 unique outlets in this data. This variable will give us
information on count of outlets in the data set. More the number of counts of an outlet, chances are
more will be the sales contributed by it.
>library(dplyr)
>a<combi%>%
group_by(Outlet_Identifier)%>%
tally()

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

27/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

>head(a)
Source:localdataframe[6x2] Outlet_Identifiern
(fctr)(int)
1OUT010925
2OUT0131553
3OUT0171543

(https:/
/datahack.analyticsvidhya.com/contest/the4OUT0181546
strategic-monk/)
5OUT019880
6OUT0271559
>names(a)[2]<"Outlet_Count"
>combi<full_join(a,combi,by="Outlet_Identifier")

As you can see, dplyr package makes data manipulation quite e ortless. You no longer need to write
long function. In the code above, Ive simply stored the new data frame in a variable a. Later, the new
column Outlet_Countis added in our original combi data set. To know more about dplyr, follow this
tutorial (https://rpubs.com/bradleyboehmke/data_wrangling).
(https:/
/datahack.analyticsvidhya.com/contest/skilltest
machine-learning/)

2. Count of Item Identi ers Similarly, we can compute count of item identi ers too. Its a good
practice to fetch more information from unique ID variables using their count. This will help us to
understand, which outlet has maximum frequency.
>b<combi%>%
group_by(Item_Identifier)%>%
tally()

>names(b)[2]<"Item_Count"
>head(b)
Item_IdentifierItem_Count
(fctr)(int)
1DRA129
2DRA2410
3DRA5910

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

28/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

4DRB018
5DRB139
6DRB248

> combi <- merge(b, combi, by = Item_Identi er)

(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)

3. Outlet Years This variable represent the information of existence of a particular outlet since year
2013.

Why

just

2013?

Youll

nd

the

answer

in

problem

statement

(http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-prediction).

here
My

hypothesis is, older the outlet, more footfall, large base of loyal customers and larger the outlet sales.
>c<combi%>%
select(Outlet_Establishment_Year)%>%
mutate(Outlet_Year=2013combi$Outlet_Establishment_Year)
>head(c)
Outlet_Establishment_YearOutlet_Year

(https:/
/datahack.analyticsvidhya.com/contest/skilltest1199914
machine-learning/)
220094
3199914
4199815
5198726
620094
>combi<full_join(c,combi)

This suggests that outlets established in 1999 were 14 years old in 2013 and so on.

4. Item Type New Now, pay attention to Item_Identi ers. We are about to discover a new
trend. Look carefully, there is a pattern in the identi ers starting with FD,DR,NC. Now, check the
corresponding Item_Types to these identi ers in the data set. Youll discover, items corresponding to

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

29/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

DR, are mostly eatables. Items corresponding to FD, are drinks. And, item corresponding to NC,
are products which cant be consumed, lets call them non-consumable. Lets extract these variables
into a new variable representing their counts.
Here Ill use substr(), gsub() function to extract and rename the variables respectively.
>q<substr(combi$Item_Identifier,1,2)
(https://datahack.analyticsvidhya.com/contest/the>q<gsub("FD","Food",q)
strategic-monk/)
>q<gsub("DR","Drinks",q)
>q<gsub("NC","NonConsumable",q)
>table(q)
DrinksFoodNonConsumable
1317102012686

Lets now add this information in our data set with a variable name Item_Type_New.
>combi$Item_Type_New<q

Ill leave
the rest of feature engineering intuition to you. You can think of more variables which could
(https:/
/datahack.analyticsvidhya.com/contest/skilltestadd more information to the model. But make sure, the variable arent correlated. Since, they are
machine-learning/)
emanating from a same set of variable, there is a high chance for them to be correlated. You can
check the same in R using cor() function.

Label Encoding and One Hot Encoding


Just, one last aspect of feature engineering left. Label Encoding and One Hot Encoding.

Label Encoding, in simple words, is the practice of numerically encoding (replacing) di erent levels of
a categorical variables. For example: In our data set, the variable Item_Fat_Contenthas 2 levels: Low
Fat and Regular. So, well encode Low Fat as 0 and Regular as 1. This will help us convert a factor
variable in numeric variable. This can be simply done using if else statement in R.
>combi$Item_Fat_Content<ifelse(combi$Item_Fat_Content=="Regular",1,0)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

30/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

One Hot Encoding, in simple words, is the splitting a categorical variable into its unique levels,
andeventually removing the original variable from data set. Confused ? Heres an example: Lets take
any categorical variable, say, Outlet_ Location_Type. It has 3 levels. One hot encoding of this variable,
will create 3 di erent variables consisting of 1s and 0s. 1s will represent the existence of variable and
0s will represent non-existence of variable. Let look at a sample:
>sample<select(combi,Outlet_Location_Type)
(https:/
/datahack.analyticsvidhya.com/contest/the>demo_sample<data.frame(model.matrix(~.1,sample))
strategic-monk/)
>head(demo_sample)
Outlet_Location_TypeTier.1Outlet_Location_TypeTier.2Outlet_Location_TypeTier.3
1100
2001
3100
4001
5001
6001

model.matrix creates a matrix of encoded variables. ~. -1 tells R, to encode all variables in the data

(https://datahack.analyticsvidhya.com/contest/skilltestframe, but suppress the intercept. So, what will happen if you dont write -1 ? model.matrix will skip
machine-learning/)

the rst level of the factor, thereby resulting in just 2 out of 3 factor levels (loss of information).

This was the demonstration of one hot encoding. Hope you have understood the concept now. Lets
now apply this technique to all categorical variables in our data set (excluding ID variable).
>library(dummies)
>combi<dummy.data.frame(combi,names=
c('Outlet_Size','Outlet_Location_Type','Outlet_Type','Item_Type_New'),sep='_')

With this, I have shared 2 di erent methods of performing one hot encoding in R. Lets check if
encoding has been done.
>str(combi)
$Outlet_Size_Other:int0110100000...
$Outlet_Size_High:int0001000000...
$Outlet_Size_Medium:int1000001101...
$Outlet_Size_Small:int0000010010...

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

31/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

$Outlet_Location_Type_Tier1:int1000000010...
$Outlet_Location_Type_Tier2:int0100110000...
$Outlet_Location_Type_Tier3:int0011001101...
$Outlet_Type_GroceryStore:int0010000000...
$Outlet_Type_SupermarketType1:int1101110010...
$Outlet_Type_SupermarketType2:int0000000100...

(https:/
/datahack.analyticsvidhya.com/contest/the$Outlet_Type_SupermarketType3:int0000001001...
strategic-monk/)
$Item_Outlet_Sales:num1382928425532553...
$Year:num14111526692841628...
$Item_Type_New_Drinks:int1111111111...
$Item_Type_New_Food:int0000000000...
$Item_Type_New_NonConsumable:int0000000000...

As you can see, after one hot encoding, the original variables are removed automatically from the
data set.

5. Predictive Modeling using Machine Learning

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Finally, well drop the columns which have either been converted using other variables or are
identi er variables. This can be accomplished using select from dplyr package.
>combi<select(combi,c(Item_Identifier,Outlet_Identifier,Item_Fat_Content,
Outlet_Establishment_Year,Item_Type))
>str(combi)

In this section, Ill cover Regression, Decision Trees and Random Forest. A detailed explanation of
these algorithms is outside the scope of this article. These algorithms have been satisfactorily
explained in our previous articles.Ive provided the links for useful resources.
As you can see, we have encoded all our categorical variables. Now, this data set is good to
takeforward to modeling. Since, we started from Train and Test, lets now divide the data sets.
>new_train<combi[1:nrow(train),]
>new_test<combi[(1:nrow(train)),]

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

32/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Linear (Multiple)Regression
Multiple Regression is used when response variable is continuous in nature and predictors are many.
Had it been categorical, we would have used Logistic Regression. Before you proceed, sharpen
your basics of Regression here (https://www.analyticsvidhya.com/blog/2015/08/comprehensive-

(https://datahack.analyticsvidhya.com/contest/theguide-regression/).
strategic-monk/)

Linear Regressiontakes following assumptions:


1. There exists a linear relationship between response and predictor variables
2. The predictor (independent) variables are not correlated with each other. Presence of collinearity leads
to a phenomenon known as multicollinearity (https://en.wikipedia.org/wiki/Multicollinearity).
3. The error terms are uncorrelated. Otherwise, it willlead to autocorrelation
(https://en.wikipedia.org/wiki/Autocorrelation#Regression_analysis).
4. Error terms must have constant variance. Non-constant variance leads to heteroskedasticity
(https://en.wikipedia.org/wiki/Heteroscedasticity).

Lets now build out rst regression model on this data set. R uses lm() function for regression.
(https://datahack.analyticsvidhya.com/contest/skilltest>linear_model<lm(Item_Outlet_Sales~.,data=new_train)
machine-learning/)
>summary(linear_model)

Adjusted R measures the goodness of t of a regression model. Higher the R, better is the model.
Our R = 0.2085. It means we really did something drastically wrong.Lets gure it out.
In our case, I could nd our new variables arent helping much i.e. Item count, Outlet Count and
Item_Type_New. Neither of these variables are signi cant. Signi cant variables are denoted by * sign.

As we know, correlated predictor variables brings down the model accuracy. Lets nd out the
amount of correlation present in our predictor variables. This can be simply calculated using:
>cor(new_train)

Alternatively, you can also use corrplot package for some fancy correlation plots. Scrolling through
the long list of correlation coe cients, I could nd a deadly correlation coe cient:

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

33/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

cor(new_train$Outlet_Count,new_train$`Outlet_Type_GroceryStore`)
[1]0.9991203

Outlet_Count is highly correlated (negatively) with Outlet Type Grocery Store. Here are some
problems I could nd in this model:
1. We have correlated predictor variables.
(https://datahack.analyticsvidhya.com/contest/the2. We did one hot encoding and label encoding. Thats not necessary since linear regression handle
strategic-monk/)
categorical variables by creating dummy variables intrinsically.
3. The new variables (item count, outlet count, item type new) created in feature engineering are not
signi cant.

Lets try to create a more robust regression model. This time, Ill be using a building a simple model
without encoding and new features. Below is the entire code:
#loaddirectory
>path<"C:/Users/manish/desktop/Data/February2016"
>setwd(path)

(https://datahack.analyticsvidhya.com/contest/skilltest#loaddata
machine-learning/)
>train<read.csv("train_Big.csv")
>test<read.csv("test_Big.csv")
#createanewvariableintestfile
>test$Item_Outlet_Sales<1
#combinetrainandtestdata

>combi<rbind(train,test)
#imputemissingvalueinItem_Weight
>combi$Item_Weight[is.na(combi$Item_Weight)]<median(combi$Item_Weight,na.rm=TRUE)
#impute0initem_visibility
>combi$Item_Visibility<ifelse(combi$Item_Visibility==0,median(combi$Item_Visibility),
combi$Item_Visibility)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

34/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

#renamelevelinOutlet_Size
>levels(combi$Outlet_Size)[1]<"Other"
#renamelevelsofItem_Fat_Content
>library(plyr)
>combi$Item_Fat_Content<revalue(combi$Item_Fat_Content,c("LF"="LowFat","reg"=
"Regular"))
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
>combi$Item_Fat_Content<revalue(combi$Item_Fat_Content,c("lowfat"="LowFat"))
#createanewcolumn2013Year
>combi$Year<2013combi$Outlet_Establishment_Year
#dropvariablesnotrequiredinmodeling
>library(dplyr)
>combi<select(combi,c(Item_Identifier,Outlet_Identifier,Outlet_Establishment_Year))
#dividedataset
>new_train<combi[1:nrow(train),]
(https:/
/datahack.analyticsvidhya.com/contest/skilltest>new_test<combi[(1:nrow(train)),]

machine-learning/)

#linearregression
>linear_model<lm(Item_Outlet_Sales~.,data=new_train)
>summary(linear_model)

Now we have got R = 0.5623. This teaches us that, sometimes all you need is simple thought process
to get high accuracy. Quite a good improvement from previous model. Next, time when you work on
any model, always remember to start with a simple model.

Lets check out regression plot to nd out more ways to improve this model.
>par(mfrow=c(2,2))
>plot(linear_model)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

35/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

You can zoom these graphs in R Studio at your end. All these plots have a di erent story to tell. But
(https:/
/datahack.analyticsvidhya.com/contest/skilltestthe most
important story is being portrayed by Residuals vs Fitted graph.
machine-learning/)

Residual values are the di erence between actual and predicted outcome values. Fitted values are
the predicted values. If you see carefully, youll discover it as a funnel shape graph (from right to left ).
The shape of this graph suggests that our model is su ering from heteroskedasticity (unequal
variance in error terms). Had there been constant variance, there would be no pattern visible in this
graph.
A common practice to tackle heteroskedasticity is by taking the log of response variable. Lets
do it
and check if we can get further improvement.
>linear_model<lm(log(Item_Outlet_Sales)~.,data=new_train)
>summary(linear_model)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

36/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

And, heres a snapshot of my model output. Congrats! We have got an improved model with R = 0.72.
Now, we are on the right path. Once again you can check the residual plots (you might zoom it). Youll
nd there is no longer a trend in residual vs tted value plot.

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

This model can be further improved by detecting outliers and high leverage points. For now, I leave
that part to you! I shall write a separate post on mysteries of regression soon. For now, lets check our
RMSE so that we can compare it with other algorithms demonstrated below.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

37/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

To calculate RMSE, we can load a package named Metrics.


>install.packages("Metrics")
>library(Metrics)
>rmse(new_train$Item_Outlet_Sales,exp(linear_model$fitted.values))
[1]1140.004

(https://datahack.analyticsvidhya.com/contest/theLets proceed to decision tree algorithm and try to improve our RMSE score.
strategic-monk/)

Decision Trees
Before you start, Id recommend you to glance through the basics of decision tree algorithms. To
understand

what

makes

it

superior

than

linear

regression,

check

(https://www.analyticsvidhya.com/blog/2015/01/decision-tree-simpli ed/)

this

tutorial Part

and

Part

1
2

(https://www.analyticsvidhya.com/blog/2015/01/decision-tree-algorithms-simpli ed/).
In R, decision tree algorithm can be implemented using rpart package. In addition, well use caret
(https:/
/datahack.analyticsvidhya.com/contest/skilltestpackage
for doing cross validation. Cross validation is a technique to build robust modelswhich are
machine-learning/)

not

prone

to

over tting.

Read

more

about

Cross

Validation

(https://www.analyticsvidhya.com/blog/2015/11/improve-model-performance-cross-validation-inpython-r/).
In R, decision tree uses a complexity parameter (cp). It measures the tradeo

between model

complexity and accuracy on training set. A smaller cp will lead to a bigger tree, which might over t
the model. Conversely, a large cp value might under t the model. Under tting occurs when the

model does not capture underlying trends properly. Lets nd out the optimum cp value for our
model with 5 fold cross validation.
#loadingrequiredlibraries
>library(rpart)
>library(e1071)
>library(rpart.plot)
>library(caret)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

38/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

#settingthetreecontrolparameters
>fitControl<trainControl(method="cv",number=5)
>cartGrid<expand.grid(.cp=(1:50)*0.01)
#decisiontree
>tree_model<train(Item_Outlet_Sales~.,data=new_train,method="rpart",trControl=
fitControl,tuneGrid=cartGrid)
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
>print(tree_model)

The nal value for cp = 0.01. You can also check the table populated in console for more information.
The model with cp = 0.01 has the least RMSE. Lets now build a decision tree with 0.01 as complexity
parameter.
>main_tree<rpart(Item_Outlet_Sales~.,data=new_train,control=
rpart.control(cp=0.01))
>prp(main_tree)

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Here is the tree structure of our model. If you have gone through the basics, you would now
understand that this algorithm has marked Item_MRP as the most important variable (being the root
node). Lets check the RMSE of this model and see if this is any better than regression.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

39/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

>pre_score<predict(main_tree,type="vector")
>rmse(new_train$Item_Outlet_Sales,pre_score)
[1]1102.774

As you can see, our RMSE has further improvedfrom 1140 to 1102.77 with decision tree. To improve
this score further, you can further tune the parameters for greater accuracy.
(https://datahack.analyticsvidhya.com/contest/the
strategic-monk/)

Random Forest
Random Forest is a powerful algorithm which holistically takes care of missing values, outliers and
other non-linearities in the data set. Its simply a collection of classi cation trees, hence the name
forest. Id suggest you to quickly refresh your basics of random forest with this tutorial
(https://www.analyticsvidhya.com/blog/2015/09/random-forest-algorithm-multiple-challenges/).
In R, random forest algorithm can be implement using randomForest package. Again, well use train
package for cross validation and nding optimum value of model parameters.
(https:/
For this/datahack.analyticsvidhya.com/contest/skilltestproblem, Ill focus on two parameters of random forest. mtry and ntree.ntree is the number
machine-learning/)

of trees to be grown in the forest. mtryis the number of variables taken at each node to build a tree.
And, well do a 5 fold cross validation.
Lets do it!
#loadrandomForestlibrary
>library(randomForest)

#settuningparameters
>control<trainControl(method="cv",number=5)
#randomforestmodel
>rf_model<train(Item_Outlet_Sales~.,data=new_train,method="parRF",trControl=
control, prox=TRUE,allowParallel=TRUE)
#checkoptimalparameters
>print(rf_model)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

40/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

If you notice, youll see Ive used method = parRF. This is parallel random forest. This is parallel
implementation of random forest. This package causes your local machine to take less time in
random forest computation. Alternatively, you can also use method = rf as a standard random forest
function.
Now weve got the optimal value of mtry = 15. Lets use 1000 trees for computation.
(https://datahack.analyticsvidhya.com/contest/skilltest#randomforestmodel
machine-learning/)
>forest_model<randomForest(Item_Outlet_Sales~.,data=new_train,mtry=15,ntree=
1000)
>print(forest_model)
>varImpPlot(forest_model)

This model throws RMSE = 1132.04 which is not an improvement over decision tree model. Random
is
forest has a feature of presenting the important variables. We see that the most important variable

Item_MRP (also shown by decision tree algorithm).

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

41/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

This model can be further improved by tuning parameters. Also,Lets make out rst submission with
our best RMSE score by decision tree.
>main_predict<predict(main_tree,newdata=new_test,type="vector")

(https:/
/datahack.analyticsvidhya.com/contest/skilltest>sub_file<data.frame(Item_Identifier=test$Item_Identifier,Outlet_Identifier=
machine-learning/)
test$Outlet_Identifier,Item_Outlet_Sales=main_predict)
>write.csv(sub_file,'Decision_tree_sales.csv')

When predicted on out of sample data, our RMSE has come out to be 1174.33.Here are some things
you can do to improve this model further:
1. Since we did not use encoding, I encourage you to use one hot encoding and label encoding for

random forest model.


2. Parameters Tuning will help.
3. Use Gradient Boosting (https://www.analyticsvidhya.com/blog/2015/09/complete-guide-boostingmethods/).
4. Build an ensemble of these models. Read more about Ensemble Modeling
(https://www.analyticsvidhya.com/blog/2015/09/questions-ensemble-modeling/).

Do implement the ideas suggested above and share your improvement in the comments section
below. Currently, Rank 1 on Leaderboard (http://datahack.analyticsvidhya.com/contest/practiceproblem-big-mart-sales-iii/lb) has obtained RMSE score of 1137.71. Beat it!

End Notes
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

42/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

End Notes
This brings us to the end of this tutorial. Regret for not so happy ending. But, Ive given you enough
hints to work on. The decision to not use encoded variables in the model, turned out to be bene cial
until decision trees.
The motive
of this tutorial was to get your started with predictive modeling in R. We learnt few
(https:/
/datahack.analyticsvidhya.com/contest/theuncanny things such as build simple models. Dont jump towards building a complex model. Simple
strategic-monk/)
models give you benchmark score and a threshold to work with.
In this tutorial, I have demonstrated the steps used in predictive modeling in R. Ive covered data
exploration, data visualization, data manipulation and building models using Regression, Decision
Trees and Random Forest algorithms.
Did you nd this tutorial useful ? Are you facing any trouble at any stage of this tutorial ? Feel free to
mention your doubts in the comments section below. Do share if you get a better score.
Edit: On visitors request, the PDF version of the tutorial is available for download. You need to create
a log /datahack.analyticsvidhya.com/contest/skilltestin account to download the PDF. Also, you can bookmark this page for future reference.
(https:/
Download Here (http://discuss.analyticsvidhya.com/t/download-free-tutorial-to-learn-data-sciencemachine-learning/)
in-r-from-scratch/7797/2).

You want to apply your analytical skills and test your potential?
Thenparticipate in our Hackathons
(http://datahack.analyticsvidhya.com/contest/all)and compete with TopData
Scientists from all over the world.

Share this:

(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=linkedin&nb=1)
744

(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=facebook&nb=1)
47

(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=googleplus1&nb=1)
(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=twitter&nb=1)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

43/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=pocket&nb=1)
(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=reddit&nb=1)

RELATED

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

(https://www.analyticsvidhya.com/

(https://www.analyticsvidhya.com/

(https://www.analyticsvidhya.com/

blog/2016/09/most-active-data-

blog/2015/12/faster-data-

blog/2016/10/18-new-must-read-

scientists-free-books-notebooks-

manipulation-7-packages/)

books-for-data-scientists-on-r-and-

tutorials-on-github/)

Do Faster Data Manipulation using


These 7 R Packages
Most Active Data Scientists, Free
(https://www.analyticsvidhya.com/
Books, Notebooks
& Tutorials on
(https:/
/datahack.analyticsvidhya.com/contest/skilltestblog/2015/12/faster-dataGithub
machine-learning/)
manipulation-7-packages/)
(https://www.analyticsvidhya.com/
In "Business Analytics"
blog/2016/09/most-active-datascientists-free-books-notebookstutorials-on-github/)

python/)
18 New Must Read Books for Data
Scientists on R and Python
(https://www.analyticsvidhya.com/
blog/2016/10/18-new-must-readbooks-for-data-scientists-on-r-andpython/)
In "Machine Learning"

In "Machine Learning"

TAGS: AUTOCORRELATION (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/AUTOCORRELATION/), CARET PACKAGE

(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/CARET-PACKAGE/), CROSS-VALIDATION (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/CROSSVALIDATION/), DATA EXPLORATION IN R (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/DATA-EXPLORATION-IN-R/), DATA MANIPULATION IN R


(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/DATA-MANIPULATION-IN-R/), DATA MINING IN R (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/DATAMINING-IN-R/), DECISION TREES IN R (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/DECISION-TREES-IN-R/), DPLYR
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/DPLYR/), FEATURE ENGINEERING IN R (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/FEATUREENGINEERING-IN-R/), GGPLOT (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/GGPLOT/), HETEROSKEDASTICITY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/HETEROSKEDASTICITY/), HOMOSKEDASTICITY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/HOMOSKEDASTICITY/), LABEL ENCODING (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/LABELENCODING/), LINEAR REGRESSION IN R (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/LINEAR-REGRESSION-IN-R/), MULTICOLLINEARITY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/MULTICOLLINEARITY/), MULTIPLE REGRESSION IN R

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

44/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/MULTIPLE-REGRESSION-IN-R/), ONE HOT ENCODING (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/ONEHOT-ENCODING/), OVERFITTING (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/OVERFITTING/), PLYR PACKAGE


(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/PLYR-PACKAGE/), PREDICTIVE MODELING (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/PREDICTIVEMODELING/), RANDOM FOREST IN R (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/RANDOM-FOREST-IN-R/), UNDERFITTING
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/UNDERFITTING/)

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Previous Article

Guide to Build Better Predictive Models


using Segmentation

Next Article

Complete Guide to Parameter Tuning in


XGBoost (with codes in Python)

(https://www.analyticsvidhya.com/blog/2016/02/guide- (https://www.analyticsvidhya.com/blog/2016/03/completebuild-predictive-models-segmentation/)
guide-parameter-tuning-xgboost-with-codespython/)

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

(https://www.analyticsvidhya.com/blog/author/avcontentteam/)
Author

Analytics Vidhya Content Team


(https://www.analyticsvidhya.com/blog/author/avcontentteam/)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

45/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://www.analyticsvidhya.com/blog/author/avcontentteam/)
Analytics Vidhya Content team

105 COMMENTS
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

SteveREPLY
(http://www.bigewisdom.net/)
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106335#RESPOND)
FEBRUARY 29, 2016 AT 3:46 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106335)

Thanks for sharing! Can this content be available in a Pdf format?


Thanks,

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106343#RESPOND)
FEBRUARY 29, 2016 AT 6:13 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106343)

Welcome Steve. I can make that available. Ill email it to you shortly.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Abhijit
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106351#RESPOND)
FEBRUARY 29, 2016 AT 7:45 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106351)

Please make it(PDF version) available for all the users as well. It will help a lot in a nutshell.

Hemant
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106366#RESPOND)
FEBRUARY 29, 2016 AT 11:07 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106366)

Manish nice content for Beginners. Thanks ! I also want this content in PDF format. Please mail this
content in PDF format to me also.

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106382#RESPOND)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

46/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

FEBRUARY 29, 2016 AT 3:19 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106382)

Hi Hemant
PDF is available for download. Link is added in the tutorial at the end.

(https://datahack.analyticsvidhya.com/contest/theHemant says:
strategic-monk/)
MARCH 6, 2016 AT 1:48 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106710)

Manish taht link is not working


http://discuss.analyticsvidhya.com/t/download-free-tutorial-to-learn-data-science-in-r-fromscratch/7797 (http://discuss.analyticsvidhya.com/t/download-free-tutorial-to-learn-datascience-in-r-from-scratch/7797)
Please see it.

(https://datahack.analyticsvidhya.com/contest/skilltestAnalytics Vidhya Content Team says:


machine-learning/)
MARCH 8, 2016 AT 5:50 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106842)

Hi Hemant
Link is working ne. You need to create a one time user login to download the PDF.

Ajit Yadav says:


JUNE 20, 2016 AT 1:59 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-

SCRATCH/#COMMENT-112402)

Hi Manish,
We are looking for R language experts with good understanding on Data Science. Required an
expert to write a book on R language using Data Science. Interested writers/experts please
contact with latest pro le at alpinessolutions at gmail dot com.

midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107001#RESPOND)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

47/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

MARCH 10, 2016 AT 11:45 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107001)

Sir, I couldnt nd the datasets mentioned in the article. Can you please guide me where can i get
the data sets. Thanks.

(https://datahack.analyticsvidhya.com/contest/theElan says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=111173#RESPOND)
strategic-monk/)
MAY 19, 2016 AT 8:59 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-111173)

Please advise how to download the data set


Couldnt nd the link after having logged in to your site

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112097#RESPOND)
JUNE 10, 2016 AT 11:07 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112097)

Hi Elan
Please download the data from here: http://datahack.analyticsvidhya.com/contest/practice(https://datahack.analyticsvidhya.com/contest/skilltestproblem-big-mart-sales-iii (http://datahack.analyticsvidhya.com/contest/practice-problem-bigmachine-learning/)
mart-sales-iii)

bgreddy
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=116143#RESPOND)
SEPTEMBER 16, 2016 AT 6:18 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-116143)

plz mail pdf on [email protected] (mailto:[email protected])

Dr.D.K.Samuel
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106336#RESPOND)
FEBRUARY 29, 2016 AT 4:09 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106336)

Nice writeup useful, thnaks Samue

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106344#RESPOND)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

48/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

FEBRUARY 29, 2016 AT 6:13 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106344)

Welcome Samuel !

Himanshu
Dhingra (http://www.gutargoo.com) says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106346#RESPOND)

(https://datahack.analyticsvidhya.com/contest/theFEBRUARY 29, 2016 AT 6:35 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEstrategic-monk/)


SCRATCH/#COMMENT-106346)
Thanks Manish. You wrote an amazing article for beginners. I was looking for an article like this
which clears the basics of R without refering to any books and all.
Even I request you to send me the doc or pdf of this so that i can get it print to make it handy to
read.

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106384#RESPOND)
FEBRUARY 29, 2016 AT 3:21 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106384)

(https://datahack.analyticsvidhya.com/contest/skilltestThanks Himanshu ! PDF is available for download. Link is added at the end of tutorial.
machine-learning/)

Krishna
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106353#RESPOND)
FEBRUARY 29, 2016 AT 8:25 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106353)

good one. pl mail me a pdf as well

Devendra
Yadav says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106359#RESPOND)
FEBRUARY 29, 2016 AT 9:26 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106359)

Hi Manish
Could you please share the pdf with me as well. I am a starter in R and this can help as a compact
guide for myself when trying out di erent things.
Thanks

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

49/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Rad REPLY
Mou(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106360#RESPOND)
says:
FEBRUARY 29, 2016 AT 9:33 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106360)

Hello, when I type log(12) I get 2.484907 as a result. What seems to be the problem ?
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Ram REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106391#RESPOND)
FEBRUARY 29, 2016 AT 4:38 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106391)

@RadMou,
It seems that there is a typo in the article. The fact is: log uses base e ; log10 uses base 10 and
log2 uses base 2.
You can see that these commands print di erent values:
log(12) # log to the base e
log10(12) # log to the base 10
(https://datahack.analyticsvidhya.com/contest/skilltestlog2(12) # log to the base 2
machine-learning/)
Hope this helps.

Zamin
Sherazi
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106361#RESPOND)
FEBRUARY 29, 2016 AT 9:58 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106361)

Thanks Manish . would be grateful if can be made available in PDF .

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106383#RESPOND)
FEBRUARY 29, 2016 AT 3:19 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106383)

Hi Zamin
PDF is available for download.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

50/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Monil
Doshi
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106364#RESPOND)
FEBRUARY 29, 2016 AT 10:23 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106364)

Hi Manish,
This is very helpful for beginners like me.
Looking forward for more.
Is there any way I can get this in PDF format?
(https://datahack.analyticsvidhya.com/contest/theIt would be really helpful
strategic-monk/)
My email id is [email protected] (mailto:[email protected]).
Thank you very much!.

Aanish
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106367#RESPOND)
FEBRUARY 29, 2016 AT 11:19 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106367)

Thanks Manish. This is a great help! I have a questions I noticed that R automatically takes care
of the factor variables (by converting them to n or n-1 dummy variables) while performing linear
regression. Do you recommend that we do it explicitly?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106379#RESPOND)
FEBRUARY 29, 2016 AT 3:01 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106379)

Hi Anish
In case of linear regression, decision trees, random forest, kNN, it is not necessary to convert
categorical variables explicitly as these algorithms intrinsically breaks a categorical variables with

n 1 levels. However, if you are using boosting algorithms (GBM, XGboost) it is recommended to
encode categorical variables prior to modeling. On a similar note, if you have followed this tutorial
youll nd that I started with one hot encoding and got a terrible regression accuracy. Later, I used
the categorical variables as it as, and accuracy improved.

kishor
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106368#RESPOND)
FEBRUARY 29, 2016 AT 11:55 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106368)

good presentation. can you please provide it in pdf format.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

51/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

chandrakala
(http://-) says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106369#RESPOND)
FEBRUARY 29, 2016 AT 12:15 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106369)

Very helpful for beginners, thanks a lot!!!! keep it up.


(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106381#RESPOND)
FEBRUARY 29, 2016 AT 3:17 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106381)

Welcome !

Raman
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106372#RESPOND)
FEBRUARY 29, 2016 AT 1:26 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106372)

Manish,
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Very valuable tutorial. TY. If it is not too much of a trouble. Can you please make a PDF version as a
link on the tutorial, please. Thanks.
Regards
Raman

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106380#RESPOND)
FEBRUARY 29, 2016 AT 3:17 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106380)

Hi Raman
Ive added the PDF link at the end of this tutorial.

Atul REPLY
Khairnar
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106373#RESPOND)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

52/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

FEBRUARY 29, 2016 AT 1:53 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106373)

Thanks for sharing this article. This is really help to us. When I ran these script on Rstudio I got two
errors for ggplot after I tried install.packages(ggplot2) AND
install.packages(ggplot2,dependencies = TRUE) and I got the following error
> ggplot(train, aes(x= Item_Visibility, y = Item_Outlet_Sales)) + geom_point(size = 2.5, color=navy) +
xlab(Item Visibility) + ylab(Item Outlet Sales)
(https://datahack.analyticsvidhya.com/contest/theError: could not nd function ggplot
strategic-monk/)
And also for merge data
> combi <- merge(b, combi, by = "Outlet_Identi er")
Error in x.by(by.x, x) : 'by' must specify a uniquely valid column
Can you help me why this happen.
Once again 'Thank You So Much' because I learn new things about R.
Thanks,
Atul
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106377#RESPOND)
FEBRUARY 29, 2016 AT 2:42 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106377)

Hi Atul
After installing the ggplot2 package, you should call the package in the next step using
library(ggplot2).
Then run the ggplot code, it should work.

merge function is used from package plyr. Have you installed it ? Let me know.

Atul Khairnar
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106421#RESPOND)
MARCH 1, 2016 AT 6:39 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106421)

Thanks Manish, I tried manually as well as by syntax through but still showing following error

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

53/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

install.packages(plyr)
library(plyr)
combi library(plyr)
Warning message:
package plyr was built under R version 3.1.3
> combi <- merge(b, combi, by = "Outlet_Identi er") ##########Error showing####
Error in x.by(by.x, x) : 'by' must specify a uniquely valid column
(https://datahack.analyticsvidhya.com/contest/theCan you please help me on thiswhy this error showing
strategic-monk/)

Arfath
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106911#RESPOND)
MARCH 9, 2016 AT 8:01 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106911)

its not combi library(plyr) but its only library(plyr)


1 more thing i want to correct here is in
combi <- merge(b,combi, by = "Outlet_Identi er")
its not Outlet_Identi er but it is Item_identi er..
so correct code is ..
combi <- merge(b,combi, by = "Item_Identi er")
(https://datahack.analyticsvidhya.com/contest/skilltesthope this helps you out.
machine-learning/)

shashi
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106378#RESPOND)
FEBRUARY 29, 2016 AT 2:48 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106378)

can u share any material of data science

mouradelghissassi1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106388#RESPOND)
FEBRUARY 29, 2016 AT 3:37 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106388)

Erratum : Im not sure if the problem is from my computer, but :


When I execute head(b) I get :
DRA12 9
RA24 10

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

54/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

And not
OUT027 2215.876
OUT035 1463.705
So the command
combi <- merge(b, combi, by = "Outlet_Identi er") should be
combi <- merge(b, combi, by = "Item_Identi er") instead
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Also in head(c) there is a problem with the years, all rows are for 1985.

mouradelghissassi1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106392#RESPOND)
FEBRUARY 29, 2016 AT 4:40 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106392)

Hence, we see that column Item_Visibility has 1463 missing values. Lets get more inferences
from this data. its the Item_Weight variable that has missing values
Also in Label Encoding and One Hot Encoding : the variable Item_Visibility has 2 levels: Low Fat
and Regular
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Its Item_Fat_Content not Item_Visibility

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106410#RESPOND)
MARCH 1, 2016 AT 4:02 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106410)

Hi
Thank you so much! Editing error. Recti ed now.

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106411#RESPOND)
MARCH 1, 2016 AT 4:22 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106411)

Hi
Thanks for pointing out. Made the changes.
In head(c), I wanted to show that using the mutate command, count value of years get

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

55/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

automatically aligned to their particular year value. Hence, I sorted it. For example, the year 1985
would get 25 as count value at all the places in count column. Anyways, Ive put a better picture of
year count now.
Hope this helps.

Balaji
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106434#RESPOND)
(https://datahack.analyticsvidhya.com/contest/theMARCH 1, 2016 AT 11:42 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEstrategic-monk/)
SCRATCH/#COMMENT-106434)

Hi Manish,
I am unable to download the pdf as i get a blank page. Kindly check

balajimadhav
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106435#RESPOND)
MARCH 1, 2016 AT 11:49 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106435)

Thanks. Works now after i relogin


(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Ambuj
Sharma
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106439#RESPOND)
MARCH 1, 2016 AT 12:29 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106439)

Hii,
When I use full_join for Outlet Years my rowcount increase to 23590924. I did not understand why
full join is used and why rowcount is increasing.

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106442#RESPOND)
MARCH 1, 2016 AT 1:09 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106442)

Hi Ambuj
full_join function returns all rows and all columns from the chosen data sets. And, if a value is not
present it blatantly returns NA. In your case, you might not have speci ed the by parameter in
full_join.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

56/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

ginisk
sam
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106546#RESPOND)
MARCH 4, 2016 AT 1:42 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106546)

What I did was after c which has 14204 rows as ws :


d%
group_by(Outlet_Establishment_Year)%>%
(https://datahack.analyticsvidhya.com/contest/thedistinct()
strategic-monk/)
then combi <- merge(d, combi, by = "Outlet_Establishment_Year")
combi will now be ready for label encoding

ginisk
sam
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106554#RESPOND)
MARCH 4, 2016 AT 6:25 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106554)

Dear Ambuj,
(https://datahack.analyticsvidhya.com/contest/skilltestAfter generated c .. i created d using distinct
machine-learning/)
d%
group_by(Outlet_Establishment_Year)%>%
distinct()
Then merge d with combi as ws :
combi <- merge(d, combi, by = "Outlet_Establishment_Year")
Then ready for encoding.

Thanks

Ambuj
Sharma
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106843#RESPOND)
MARCH 8, 2016 AT 5:51 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106843)

Thanks!

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

57/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

gaurav
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106463#RESPOND)
MARCH 2, 2016 AT 5:01 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106463)

Hi ,
Can you please send me the pdf le on [email protected]
(mailto:[email protected]) as i am unable to download the le from the link provided?
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Thanks in advance

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106476#RESPOND)
MARCH 2, 2016 AT 9:08 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106476)

Hi Gaurav,
As mentioned, you need to create a one-time user account to download the pdf. You can nd the
link in the End Notes.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Jhanak
Sharma1
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106708#RESPOND)
MARCH 6, 2016 AT 11:47 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106708)

Problem no.1 :
When I execute head(b) I get :
Item_Identi er Item_Count
(fctr) (int)
1 DRA12 9
2 DRA24 10

And not
OUT027 2215.876
OUT035 1463.705
I tried below command but again error:
> combi <- merge(b, combi, by = "Outlet_Identi er")
Error in x.by(by.x, x) : 'by' must specify a uniquely valid column

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

58/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Problem No.2 :
When I execute table(q)
I get:
Drinks Food Non-Consumable
2180488 16949063 4461373
and not
Drinks Food Non-Consumable
1317 10201 2686
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Problem No.3 :
combi <- dummy.data.frame(combi, names =
+ c('Outlet_Size','Outlet_Location_Type','Outlet_Type', 'Item_Type_New'), sep='_')
Error: cannot allocate vector of size 256.0 Mb
In addition: Warning messages:
1: In anyDuplicated.default(row.names) :
Reached total allocation of 3947Mb: see help(memory.size)
2: In anyDuplicated.default(row.names) :
Reached total allocation of 3947Mb: see help(memory.size)
Q. How to deal with Error: "cannot allocate vector of size"?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Please help me for solutions to the problems stated above

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106848#RESPOND)
MARCH 8, 2016 AT 7:04 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106848)

Hi Jhanak
Thank you so much for pointing this out.
Answer 1: The code is correct. The output I used required update. Its done now. You can check.
Answer 2: Ill require your code to answer it. Because, Ive checked again at my side, the output of
table(q) is
Drinks Food Non-Consumable
1317 10201 2686
Answer 3: Looks like your Problem 2 and Problem 3 are related. After you combine the data set,
check the dimension of combi data set. It should be 14204 rows and 12 columns.Looks like your
combi data set has too many observations. Usually, memory management issues are solved using

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

59/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

2 ways. First, by upgrading machine speci cations. Second, by using sparse matrix for
computation. Also, while using R and doing computation, it is advisable to close other programs
which are not necessary, especially chrome tabs. This will allow R to compute faster.

midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107334#RESPOND)
MARCH 14, 2016 AT 9:13 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE(https://datahack.analyticsvidhya.com/contest/theSCRATCH/#COMMENT-107334)
strategic-monk/)

Hi Janak, the dataset is not available now. It seems you have worked on the dataset. Can you
please share the dataset to [email protected] (mailto:[email protected]) It would be
of great help. Thanks.

VenuREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106698#RESPOND)
MARCH 6, 2016 AT 10:43 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106698)

Could you please share the data (./Data/BigMartSales) that you have used here so that we can
play it with ?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

VenuREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106699#RESPOND)
MARCH 6, 2016 AT 10:48 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106699)

It seems that your PDF le is missing in the correct link. May I request you to update it. Thanks in
advance.

Fred REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106702#RESPOND)
MARCH 6, 2016 AT 11:00 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106702)

I got the PDF le, Thanks

buvana
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106791#RESPOND)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

60/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

MARCH 7, 2016 AT 9:27 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106791)

nice tutorial. I have 2 questions so far


a) how to save my work for e.g all the data manipulation steps i did are lost the next day and i
have to start from the setwd(path) command again
b) what is the di erence between merge and full_join in the tutorial? when is each command more
appropriate?
(https://datahack.analyticsvidhya.com/contest/thec) The group by Item_identi er is not working properly. The sample output is wrong
strategic-monk/)

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106849#RESPOND)
MARCH 8, 2016 AT 7:16 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106849)

Hi Buvana
Answer a ) Do you directly write codes in console ? Use R Studio. You should use R script as they
can be saved in .R format and helps you to retrieve codes at later time. For more information,
check the rst section of this tutorial.
Answer b) full_join is used when we wish to combine two columns. It return NA when no matching
value are found. merge is used when we wish to combine two columns based on a column type.
(https://datahack.analyticsvidhya.com/contest/skilltestIn full_join, you dont need to specify by parameter.
machine-learning/)
Answer c) Thank for pointing out. Sorted now.

Guilherme
Cadori says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106897#RESPOND)
MARCH 9, 2016 AT 4:00 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106897)

Hi,

In the Random Forest section, could you please explain why did you use ntree = 1000 after nding
mtry = 15?
Cheers,

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106982#RESPOND)
MARCH 10, 2016 AT 7:07 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106982)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

61/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Hi Guilherme
If you carefully check random forest section, Ive initially done cross validation using caret
package. Cross validation provided the optimal value of mtry and ntree at which the RMSE is least
(check output). I, then used those parameters in the nal random forest model. Another method to
choose mtry and ntree is hit and trial, which is certainly time consuming and inconsistent. You may
try this experiment at your end, and let me know if you obtain lesser RMSE than what Ive got.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Guilherme
Cadori says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107004#RESPOND)
MARCH 10, 2016 AT 12:46 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107004)

Hi Manish,
Thank you for your attention. I understood how you got mtry. However, in the output printed in this
tutorial, theres no valeu regarding ntree (e.g. ntree=1000, which was the value you used later on).
How did you get it?
Thanks,
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Arfath
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106925#RESPOND)
MARCH 9, 2016 AT 12:58 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106925)

Thank you very much for this wonderful and unique post. i came to this site to participate date
with your data competition. i was puzzled looking at the datsets like train,test and sample & i dont
have any idea what,and how to solve this. later on i came across this post (thank God i did) and

really after going through your post i gained con dence & i got a clear picture on how to handle
these competitions. once agian thanx from bottom of my heart.since i m completely new to this i
have few doubts
1) in linear_model <- lm(Item_Outlet_Sales ~ ., data = new_train)" what does tilde(~) followed by
dot (.) means?
2) what is the best RMSE score for any model?
3) so both train and test datsets are same,only thing is test data doesnt have response variable.
But, if we do know the response variable value from train dataset, again why we we are
calculating it for test data set? is it because we want to construct a model which predicts the

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

62/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

future outcomes, but we want to test how good our model predicts value, so thats why we took
sample from main dataset and cross check our predicted values with that of main dataset ?
correct me if my understanding is wrong

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106986#RESPOND)
MARCH 10, 2016 AT 8:09 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE(https://datahack.analyticsvidhya.com/contest/theSCRATCH/#COMMENT-106986)
strategic-monk/)

Hi Arfath
Good to know that you have started learning.
Answer 1: tilde(~) followed by dot (.) tells the model to select all the variables at once. Otherwise, it
would be so much inconvenient to write name of all variables one by one. Imagine the time which
would get wasted if you have got 200 variables to write. Therefore, use this short sign tilde(~)
followed by dot (.)
Answer 2: Ideally, every model strives for achieve RMSE as much as close to Zero. Because, Zero
means your model has accurately predicted the outcome. But, thats not possible. Since, every
model has got irreducible error which a ects the accuracy. Hence, the best RMSE score is the
least score you can get.
(https://datahack.analyticsvidhya.com/contest/skilltestAnswer 3: You are absolutely. Train data set has response variable and a model is trained on that.
machine-learning/)
This model gives you a fantastic RMSE score. But, it is worthless until it predicts with same
accuracy on out of sample data. The ultimate aim for this model is to make future predictions.
Right ? Hence, test data is used to check out of sample accuracy of the model. If the accuracy is
not as good as you achieved on train data set, it suggests that over tting has taken place.
I would recommend you to read Introduction to Statistical Learning. Download link is available in
my previous article: http://www.analyticsvidhya.com/blog/2016/02/free-read-books-statistics
mathematics-data-science/ (http://www.analyticsvidhya.com/blog/2016/02/free-read-booksstatistics-mathematics-data-science/)

vijaypk10
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106936#RESPOND)
MARCH 9, 2016 AT 5:43 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106936)

I am a little late to the game. How do i download the BigMartSales data?

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

63/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106981#RESPOND)
MARCH 10, 2016 AT 6:58 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106981)

Hi Vijay
Link is available in the tutorial.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Idea4Life
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107008#RESPOND)
MARCH 10, 2016 AT 1:59 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107008)

Sorry Manish. The link i believe you are mentioning is Big Mart Sales Prediction. But when i go
into it, it says The dataset is accessible only if the contest is active. Can you please check and
clarify?
Thanks,
Vijay

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
VK says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107105#RESPOND)
MARCH 11, 2016 AT 7:04 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107105)

Sorry Manish. Tried from the link Big Mart Sales Prediction in the document. But when i go to the
link Data Set, it shows up the following message:
The dataset is accessible only if the contest is active.
Can you please validate again?

Thanks.

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107145#RESPOND)
MARCH 12, 2016 AT 5:10 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107145)

Hi Vijay

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

64/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

The contest will get active again from tomorrow (13th March 2016).
Regret the inconvenience caused.

Alfa REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106993#RESPOND)
MARCH 10, 2016 AT 9:52 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106993)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Thanks for sharing.

I just can not understand what the One Hot Encoding means and how to use it. Because I just new
here.
Thanks!

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107148#RESPOND)
MARCH 12, 2016 AT 5:18 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107148)

Hi Alfa
One Hot Encoding is nothing but, splitting the levels of a categorical variable into new variable.
(https://datahack.analyticsvidhya.com/contest/skilltestThe new variables will be encoded with 0s and 1s. 1s represent the presence of information. 0s
machine-learning/)
represent the absence of information.
For example: Suppose, we have a variable named as Hair Color. It has 3 levels namely Red Hair,
Black Hair, Brown Hair. Doing one hot encoding of this variable, will result in 3 di erent variables
namely Red Hair, Black Hair, Brown Hair. And, the original variable Hair Color will be removed from
data set.
If someone has Red Hair, Red Hair variable will be 1, Black Hair will be 0, Brown Hair will be 0.
If someone has Black Hair, Red Hair variable will be 0, Black Hair will be 1, Brown Hair will be 0.
If someone has Brown Hair, Red Hair variable will be 0, Black Hair will be 0, Brown Hair will be 1.

This is One Hot Encoding.

Prateek
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107263#RESPOND)
MARCH 13, 2016 AT 2:10 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107263)

Hi Manish,

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

65/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Is it advisable to use One hot encoding when there is huge number of levels in a categorical
variable ?

midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107002#RESPOND)
MARCH 10, 2016 AT 11:48 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107002)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Can someone please mail me the data sets we need for this article to [email protected]

(mailto:[email protected]). I couldnt nd at the mentioned location. It would be really


helpful. Thanks

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107147#RESPOND)
MARCH 12, 2016 AT 5:12 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107147)

Hi Midhun
The data set will be available for download from tomorrow onwards (13th March 2016)
Regret the inconvenience caused.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107323#RESPOND)
MARCH 14, 2016 AT 7:27 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107323)

Hi Manish, sorry to bother you but it seems the data set is still unavailable. If its not too much
trouble, can you please mail the data to [email protected] (mailto:[email protected])

manojlakki7
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107069#RESPOND)
MARCH 11, 2016 AT 4:18 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107069)

Hi Manish,
Its a great article & gives a good start for beginner like me. Can you please share the data. I cant
download it from the link as the contest is not active.
Thank You

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

66/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107146#RESPOND)
MARCH 12, 2016 AT 5:11 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107146)

Hi Manoj
The data set will be available for download from tomorrow onwards. (13th March 2016)

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Roy Basan (http://none) says:

REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107491#RESPOND)

MARCH 16, 2016 AT 12:08 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107491)

Good DayWhen I try to instal library(swirl) n R studio console ,,it states its not found in the version
R.3..2.4.. I got errors which statesWarning in install.packages :
package library(swirl) is not available (for R version 3.2.4)
Can somebody explain to me this peculiarity and how can I sort it out
Thanks

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107544#RESPOND)

(https://datahack.analyticsvidhya.com/contest/skilltestMARCH 17, 2016 AT 4:12 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEmachine-learning/)


SCRATCH/#COMMENT-107544)

Hi Roy
First you should install swirl package and then call it using library function. Use the commands
below.
> install.packages(swirl)
> library(swirl)

midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107564#RESPOND)
MARCH 17, 2016 AT 5:36 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107564)

Hi Manish, The datasets are available now. Thank you so much.

victoronclinx
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107616#RESPOND)
MARCH 17, 2016 AT 9:22 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107616)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

67/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

I encounter problems to log in http://datahack.analyticsvidhya.com/signup


(http://datahack.analyticsvidhya.com/signup) Can you help me ?
I want to log in to then download the data set
Thanks in advance

(https://datahack.analyticsvidhya.com/contest/theAnalytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107636#RESPOND)
strategic-monk/)
MARCH 18, 2016 AT 5:37 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107636)

Hello
There were some technical updates going on at the server. Things are ne now. You may try again.
Regret the inconvenience caused.

Sourabh1987
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107799#RESPOND)
MARCH 19, 2016 AT 7:56 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107799)

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
trying feature engineering of the outlet _establishment year ,but the code for merging is creating a
lot of rows , i tried both merge as well full join .

JAYMIN
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107918#RESPOND)
MARCH 21, 2016 AT 11:48 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107918)

hello sir i am a fresher electrical engineer and my maths and logical thinking is good can i become
data scientist sir give me some advice thanks

Roy Basan
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=108023#RESPOND)
MARCH 22, 2016 AT 7:33 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-108023)

I did try to see the link to try the Big Market Prediction but unable to open it as it requires
membership. Now when I apply for the analytics Vidhya membership by signing up I got and
Invalid Request twice May I know how I can get over this issue.. Why I cant sign up..so I can

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

68/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

continue with my R self tutorial work..

Hulisani
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=109002#RESPOND)
APRIL 5, 2016 AT 4:02 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-109002)

(https://datahack.analyticsvidhya.com/contest/theHi
strategic-monk/)
Thanks for an amazing article. Can you please email me the data used.

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112100#RESPOND)
JUNE 10, 2016 AT 11:09 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112100)

Hi Hulisani
Please download the data set from here: http://datahack.analyticsvidhya.com/contest/practiceproblem-big-mart-sales-iii (http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-iii)
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Priyanka
Nath says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=109960#RESPOND)
APRIL 24, 2016 AT 7:20 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-109960)

Hi,
I am facing a problem in Random Forest execution.
I am using R Studio (R version 3.2.4 Revised)
When I am trying to run the code;
> rf_model print(rf_model), it is returning error in this form :

Error in { : task 1 failed cannot allocate vector of size 554.2 Mb In addition: Warning messages:
1: executing %dopar% sequentially: no parallel backend registered
2: In eval(expr, envir, enclos) :
model t failed for Fold1: mtry=15 Error in { : task 1 failed cannot allocate vector of size 354.7 Mb
3: In eval(expr, envir, enclos) :
model t failed for Fold2: mtry= 2 Error in { : task 1 failed cannot allocate vector of size 177.3 Mb

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

69/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

4: In eval(expr, envir, enclos) :


model t failed for Fold2: mtry=28 Error in { : task 1 failed cannot allocate vector of size 177.3 Mb
5: In eval(expr, envir, enclos) :
model t failed for Fold3: mtry=15 Error in { : task 1 failed cannot allocate vector of size 177.4 Mb
6: In eval(expr, envir, enclos) :
model t failed for Fold4: mtry= 2 Error in { : task 1 failed cannot allocate vector of size 354.8 Mb
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
7: In eval(expr, envir, enclos) :
model t failed for Fold4: mtry=28 Error in { : task 1 failed cannot allocate vector of size 354.8 Mb
8: In eval(expr, envir, enclos) :
model t failed for Fold5: mtry=15 Error in { : task 1 failed cannot allocate vector of size 177.4 Mb
9: In nominalTrainWork ow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
10: display list redraw incomplete
Timing stopped at: 1.26 0.3 2.49
Can you please suggest me any way out of this issue?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Priyanka
Nath says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=109961#RESPOND)
APRIL 24, 2016 AT 7:23 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-109961)

The code I am trying to run is :


rf_model <- train(Item_Outlet_Sales ~ ., data = new_train, method = "parRF", trControl = control,
prox
= TRUE, allowParallel = TRUE)
print(rf_model)

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112101#RESPOND)
JUNE 10, 2016 AT 11:11 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112101)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

70/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Hi Priyanka
Had I been at your place, I wouldnt have experimented with parallel random forest on this
problem.
Why make things complicated when it can be done in a simple way!
Also, make sure that you drop the ID column before running any algorithm. Things should work
ne then.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Raju REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=110044#RESPOND)
APRIL 26, 2016 AT 9:00 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-110044)

Hi Manish,
After reading the whole article, I feel u have done a great job and have given more than enough
data for a beginner.
Im thankful to u for sharing all your solutions, this would give us di erent thought for us to start
with.
Regards,
Raju.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112099#RESPOND)
JUNE 10, 2016 AT 11:08 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112099)

Glad it helped you. Thanks for your kind words Raju!

Gregory
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=111547#RESPOND)
MAY 28, 2016 AT 11:50 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-111547)

Good morning
I can not nd the data set. Any suggestion?

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112098#RESPOND)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

71/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

JUNE 10, 2016 AT 11:07 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112098)

Hi Gregory
Please download the data from here: http://datahack.analyticsvidhya.com/contest/practiceproblem-big-mart-sales-iii (http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-iii)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Gregory
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=111548#RESPOND)
MAY 28, 2016 AT 11:52 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-111548)

OK. Ive registered and I think itll be OK.


Thanks

Toddhim
REPLY says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112093#RESPOND)
JUNE 10, 2016 AT 9:45 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-

(https://datahack.analyticsvidhya.com/contest/skilltestSCRATCH/#COMMENT-112093)
machine-learning/)
I know this is months after this great article was published, but im just now working through this
and the BigMart Sales Prediction dataset isnt available. Is it available elsewhere?

Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112096#RESPOND)
JUNE 10, 2016 AT 11:06 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112096)

Hi Toddim,
The data set is very well available. Ive already updated the links.
You can download the data from here: http://datahack.analyticsvidhya.com/contest/practiceproblem-big-mart-sales-iii (http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-iii)

vipin REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=113378#RESPOND)
JULY 12, 2016 AT 5:22 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-113378)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

72/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Hi Manish,
First of all thanks for a great article.
I encountered with a issue when I was running the codecombi <- full_join(c, combi, by="Outlet_Establishment_Year")
it is giving me errorError: std::bad_alloc
what it is and how to correct this
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

vipinREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=113383#RESPOND)
JULY 12, 2016 AT 8:36 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-113383)

2.
combi <- dummy.data.frame(combi, names = c('Outlet_Size','Outlet_Location_Type','Outlet_Type',
'Item_Type_New'), sep='_')
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

(https://datahack.analyticsvidhya.com/contest/skilltestsimarREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114293#RESPOND)
machine-learning/)
JULY 31, 2016 AT 3:27 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114293)

Hi Manish,
Can you please let me know what do you mean by Item_Fat_Content has mismatched factor
levels?

ParulREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114413#RESPOND)
AUGUST 3, 2016 AT 5:50 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114413)

Hi Manish. Thanks for this article. Very well written and will help all. I have one query: I could follow
your post very well beforeGraphical representation of Variables, after which I am unable to gure
out how to write these codes and what do they mean & signify, how to know which command to
use & when? I am a beginner in R . Can you please suggest what to do in order for me to fully
understand all the steps from Graphical Representation. This includes Data manipulation and
Predictive modeling as well. Thanks a lot.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

73/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

KarlWang
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114457#RESPOND)
AUGUST 4, 2016 AT 8:37 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114457)

Very great article and thank you so much for sharing your knowledge! I am not sure if others have
some questions with me, but I list my questions. Hope you have some time to take a look at it.
(https://datahack.analyticsvidhya.com/contest/theThank you again.
strategic-monk/)
1. About the di erence between label encoding and one hot encoding. For label encoding, your
example is convert the 2 levels variables item_Fat_Content into 0 and 1. If I have a variable US
state (50 levels = 50 States), is it means I just need simply trans the states to number 1-50? But it is
still a one variables, just from category to numerical, am I right?
2. For one hot encoding, I need split into 50 variables (50 States) and marked them as 0s and 1s to
indicate existence or non-existence, am I right?
3. So what is the advantage and disadvantage to convert the category variables into numeric
variables? Why do we need to do this transformation?
4. In the article it said, We did one hot encoding and label encoding. Thats not necessary since
linear regression handle categorical variables by creating dummy variables intrinsically. How do
we know which model we need to do the one hot encoding/ label encoding?
(https://datahack.analyticsvidhya.com/contest/skilltest5. You mentation correlated variables. What level of correlation we need to remove the correlated
machine-learning/)
variables? 0.5 or 0.6 or 0.7 ? And if two variables is correlated, how to decide which one we should
remove? Is there any standard about it?
6. I am running logistic regression, when I remove one of the correlated variables (0.68), the R
dropped, is it means this level (0.68) correlation is acceptable?
7. The liner regression model with funnel share means heteroscedasticity. So how to evaluate the
logistic regression with Residuals vs Fitted graph?
8. In the article, it is said This model can be further improved by detecting outliers and high
leverage points. what is the technical to deal with these points? Just simply remove the record or

use the average to replace the value or other ways?


9. optimum cp value for our model with 5 fold cross validation. In my mind, cross validation is
used for evaluate the model stability which is the last step. However, at here, we use cross
validation to optimum cp value, am I understand right?
10. Why are you using 5 fold cross validation instead of 4 fold or 6 fold or 10 fold?
11. When I running the model, it always have error told me the tree cannot split. Is there any
requirement with the decision tree? Such as we cannot use category variables in decision tree?
12. How to do the Parameters Tuning for random forest? Could you points any arterials?
Thank you !!!

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

74/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Monish
Mathpal
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114761#RESPOND)
AUGUST 12, 2016 AT 6:53 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114761)

In the below excerpts of the article:


Data Frame: This is the most commonly used member of data types family. It is used to store
tabular data. It is di erent from matrix. In a matrix, every element must have same class. But, in a
data frame, you can put list of vectors containing di erent classes. This means, every column of a
(https://datahack.analyticsvidhya.com/contest/thedata frame acts like a list. Every time you will read data in R,
strategic-monk/)
it seems bit unconvincing that column of a dataframe acts like a list, instead column it should be
row as per my understanding:

Vaibhav
says:
REPLYGupta
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114970#RESPOND)
AUGUST 20, 2016 AT 6:59 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114970)

Hi
I am beginner in Data Science using R. I was going through your well articulated article on Data
Science using R. I was practicing your Big Mart Predication and got confused with one step ,
(https://datahack.analyticsvidhya.com/contest/skilltestwhere it checks the missing values in train data exploration. As per R and this tutorial , there is only
machine-learning/)
missing values (i assume blank is being considered as missing data) in Item_Weight but data is
also missing in Outlet_Size in Train CSV.. But neither R or this tutorial is showing Outlet_Size as
missing values observations.
Can you please let me know how and why Outlet_Size is not considered as missing values in
data exploration of train.

Vaibhav
says:
REPLYGupta
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114972#RESPOND)
AUGUST 20, 2016 AT 8:43 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114972)

Hi
I would also like to know what all mathematical concepts like algebra , statics, are required to
learn Data Science using R? Can anybody list down all mathematical concepts required for Data
Science?

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

75/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Thanks
Vaibhav Gupta

IigoREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=115093#RESPOND)
AUGUST 24, 2016 AT 1:28 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-115093)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Hello, I had an error when launching RStudio. I downloaded it again and installed it again, but

when I downloaded for the second time I found this phrase:


RStudio requires R 2.11.1 (or higher). If you dont already have R, you can download it here. (here is
a link)
So, before installing this, it looks like normal R has to be installed rst. I write this in case someone
had the same problem.
Good job with the web, I really like it

(https://datahack.analyticsvidhya.com/contest/skilltestShuu REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=115131#RESPOND)
machine-learning/)
AUGUST 25, 2016 AT 1:13 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-115131)

As someone who came from a non-coding background, you should know that small details can
become HUGE hindrances in the learning process of a beginner.
On the Essentials part of the article, this code doesnt work:
> bar class(bar)
> integer
> as.numeric(bar)
> class(bar)
> numeric
> as.character(bar)
> class(bar)
> character

You have to actually set it as bar <- as.numeric(bar)' on the 4th line.

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

76/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Please, keep those small things in mind. It is insanely di cult for someone like me to learn this
content, if things are any less than perfect, it really becomes impossible (I just spent almost an
hour to gure out why I couldn't change the class of the object, and in the end, had to ask for
external help since I couldn't troubleshoot it myself).
Otherwise, great article, keep the great work up!
Cheers.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

JoyceREPLY
Salil(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=117457#RESPOND)
says:
OCTOBER 24, 2016 AT 8:52 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-117457)

Thanks you made R programming simpler.


Could you please email the PDF of the same.

LEAVE A REPLY

(https://datahack.analyticsvidhya.com/contest/skilltestConnect with:
machine-learning/)
(https://www.analyticsvidhya.com/wp-login.php?

action=wordpress_social_authenticate&mode=login&provider=Facebook&redirect_to=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.anal
tutorial-learn-data-science-scratch%2F)
Your email address will not be published.

Comment

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

77/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Name (required)

Email (required)

Website
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
SUBMIT COMMENT

TOP AV USERS
Rank
1

Name

Points
SRK (https://datahack.analyticsvidhya.com/user/pro le/SRK)

(https:/
2 /datahack.analyticsvidhya.com/contest/skilltestAayushmnit (https://datahack.analyticsvidhya.com/user/pro le/aayushmnit)
machine-learning/)

5388
4978

vopani (https://datahack.analyticsvidhya.com/user/pro le/Rohan Rao)

4433

Nalin Pasricha (https://datahack.analyticsvidhya.com/user/pro le/Nalin)

4417

binga (https://datahack.analyticsvidhya.com/user/pro le/binga)

3371

More Rankings (http://datahack.analyticsvidhya.com/users)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

78/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(http://www.greatlearning.in/great-lakes-pgpba?
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

utm_source=avm&utm_medium=avmbanner&utm_campaign=pgpba)

POPULAR POSTS
A Complete Tutorial to Learn Data Science with Python from Scratch
(https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-pythonscratch-2/)
(https:/
/datahack.analyticsvidhya.com/contest/skilltestA Complete
Tutorial on Tree Based Modeling from Scratch (in R & Python)
machine-learning/)
(https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-inpython/)
Essentials of Machine Learning Algorithms (with Python and R Codes)
(https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)
17 Ultimate Data Science Projects To Boost Your Knowledge and Skills (& can be accessed freely)
(https://www.analyticsvidhya.com/blog/2016/10/17-ultimate-data-science-projects-to-boost-yourknowledge-and-skills/)
7 Types of Regression Techniques you should know!
(https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/)

A Complete Tutorial on Time Series Modeling in R


(https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/)
6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)
(https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/)
Complete guide to create a Time Series Forecast (with Codes in Python)
(https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

79/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

(http://imarticus.org/diploma-in-big-data-analytics?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
id=AnalyticsVidhya)

RECENT POSTS

(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learningrevealed/)

Solutions for Skilltest Machine Learning : Revealed


(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learning-revealed/)
ANKIT GUPTA , NOVEMBER 20, 2016

An Introduction to APIs (Application Programming Interfaces) & 5 APIs a Data Scientist must know!
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application-programminginterfaces-5-apis-a-data-scientist-must-know/)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

80/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-applicationprogramming-interfaces-5-apis-a-data-scientist-must-know/)
SAURAV KAUSHIK , NOVEMBER 18, 2016

(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarup(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
bhattacharjee-analytics-vidhya-rank-8/)

Exclusive Interview with Data Scientist Bishwarup Bhattacharjee (Analytics Vidhya Rank 8)
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarup-bhattacharjeeanalytics-vidhya-rank-8/)
KUNAL JAIN , NOVEMBER 16, 2016

(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-tobreak-the-ice-monday-blues/)

8 Interesting Data Science Games to break the ice & Monday Blues!
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-to-break-the-ice(https://datahack.analyticsvidhya.com/contest/skilltestmonday-blues/)
machine-learning/)
KUNAL JAIN , NOVEMBER 14, 2016

(http://www.edvancer.in/certi ed-data-scientist-with-python-

course?utm_source=AV&utm_medium=AVads&utm_campaign=AVadsnonfc&utm_content=pythonavad)

GET CONNECTED
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

81/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

GET CONNECTED

7,155

FOLLOWERS

(http://www.twitter.com/analyticsvidhya)

1,425

22,827
FOLLOWERS

(http://www.facebook.com/Analyticsvidhya)

Email

(https:/
/datahack.analyticsvidhya.com/contest/the-
strategic-monk/)
FOLLOWERS
SUBSCRIBE
(https://plus.google.com/+Analyticsvidhya)
(http://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)

ABOUT US
For those of you, who are wondering what is Analytics Vidhya, Analytics can be de ned as the science of
extracting insights from raw data. The spectrum of analytics starts from capturing data and evolves into
(https://datahack.analyticsvidhya.com/contest/skilltestusing insights / trends from this data to make informed decisions. Read More
machine-learning/)
(http://www.analyticsvidhya.com/about-me/)

STAY CONNECTED

7,155

FOLLOWERS

(http://www.twitter.com/analyticsvidhya)

1,425

FOLLOWERS

LATEST POSTS

(http://www.facebook.com/Analyticsvidhya)

FOLLOWERS

(https://plus.google.com/+Analyticsvidhya)

22,827
Email

SUBSCRIBE

(https://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)

(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learningrevealed/)

Solutions for Skilltest Machine Learning : Revealed


https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

82/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Solutions for Skilltest Machine Learning : Revealed


(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learning-revealed/)
ANKIT GUPTA , NOVEMBER 20, 2016

(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application(https:/
/datahack.analyticsvidhya.com/contest/theprogramming-interfaces-5-apis-a-data-scientist-must-know/)
strategic-monk/)
AnIntroduction to APIs (Application Programming Interfaces) & 5 APIs a Data Scientist must know!

(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application-programminginterfaces-5-apis-a-data-scientist-must-know/)
SAURAV KAUSHIK , NOVEMBER 18, 2016

(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarupbhattacharjee-analytics-vidhya-rank-8/)

Exclusive Interview with Data Scientist Bishwarup Bhattacharjee (Analytics Vidhya Rank 8)
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarup-bhattacharjeeanalytics-vidhya-rank-8/)

(https://datahack.analyticsvidhya.com/contest/skilltestKUNAL JAIN , NOVEMBER 16, 2016


machine-learning/)

(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-tobreak-the-ice-monday-blues/)

8 Interesting Data Science Games to break the ice & Monday Blues!
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-to-break-the-icemonday-blues/)

KUNAL JAIN , NOVEMBER 14, 2016

QUICK LINKS
Home (https://www.analyticsvidhya.com/)
About Us (https://www.analyticsvidhya.com/about-me/)
Our team (https://www.analyticsvidhya.com/aboutme/team/)
Privacy Policy
(https://www.analyticsvidhya.com/privacy-policy/)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

83/84

20/11/2016

ACompleteTutorialtolearnDataScienceinRfromScratch

Refund Policy
(https://www.analyticsvidhya.com/refund-policy/)
Terms of Use (https://www.analyticsvidhya.com/terms/)

TOP REVIEWS
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)

Copyright 2016 Analytics Vidhya

(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)

https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two

84/84

You might also like