A Complete Tutorial To Learn Data Science in R From Scratch
A Complete Tutorial To Learn Data Science in R From Scratch
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://www.facebook.com/AnalyticsVidhya)
(https://twitter.com/analyticsvidhya)
(https://plus.google.com/+Analyticsvidhya/posts)
(https://www.linkedin.com/groups/Analytics-Vidhya-Learn-everything-about-5057165)
(https://www.analyticsvidhya.com)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
MACHINE LEARNING
R (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/R/)
www.facebook.com/sharer.php?u=https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-scienceomplete%20Tutorial%20to%20learn%20Data%20Science%20in%20R%20from%20Scratch)
(https://twitter.com/home?
lete%20Tutorial%20to%20learn%20Data%20Science%20in%20R%20from%20Scratch+https://www.analyticsvidhya.com/blog/2016/02/complete-
science-scratch/)
(https://plus.google.com/share?url=https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-science-
p://pinterest.com/pin/create/button/?url=https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-sciencetps://www.analyticsvidhya.com/wp-content/uploads/2016/02/graphics-
A%20Complete%20Tutorial%20to%20learn%20Data%20Science%20in%20R%20from%20Scratch)
Drive revenue
app
advertising
Instant
access tothrough
millions of
Google
advertisers
Introduction
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
1/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
R is a powerful language used widely for data analysis and statistical computing. It was developed in
early 90s. Since then, endless e orts have been made to improve Rs user interface. The journey of R
language from a rudimentary text editor to interactiveR Studio and more recentlyJupyter Notebooks
(http://discuss.analyticsvidhya.com/t/how-to-run-r-on-jupyter-ipython-notebooks/5512)
has
Table of Contents
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
2/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Table of Contents
1. Basics of R Programming for Data Science
Why learn R ?
How to install R / R Studio ?
How to install R packages ?
Basic computations in R
2. Essentials ofR Programming
Data Types and Objects in R
Control Structures (Functions)in R
Useful R Packages
3. Exploratory Data Analysis in R
Basic Graphs
Treating Missing values
Working with Continuous and Categorical Variables
4. Data Manipulation in R
Feature Engineering
Label Encoding / One Hot Encoding
5. Predictive Modeling using Machine Learning in R
Linear Regression
Decision Tree
Random Forest
Note: The data set used in this article is from Big Mart Sales Prediction
(http://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii).
1. Basics of R Programming
Why learn R ?
I dont know if I have a solid reason to convince you, but let me share what got me started. I have no
prior coding experience. Actually, I never had computer science inmy subjects. I came toknow that
to learn data science, one must learn either R or Python as a starter. I chose the former. Here are
some bene ts I found after using R:
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
3/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
There are many more bene ts. But, these are the ones which have kept me going. If you think they
are exciting, stick around and move to next section. And, if you arent convinced, you may like
Complete Python Tutorial from Scratch (https://www.analyticsvidhya.com/blog/2016/01/completetutorial-learn-data-science-python-scratch-2/).
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
4/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
5/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
The sheer power of R lies in its incredible packages. In R, most data handling tasks can be performed
in 2 ways: Using R packages and R base functions. In this tutorial, Ill also introduce you with themost
handyand powerful R packages. To install a package, simply type:
install.packages("packagename")
As a
rst time user, a pop might appear to select your CRAN mirror (country server), choose
Basic Computations in R
Lets begin with basics. To get familiar with R coding environment, start with some basic calculations.
R console can be used as an interactive calculator too. Type the following in your console:
>2+3
>5
>6/3
>2
>(3*8)/(2*3)
>4
>log(12)
>1.07
>sqrt(121)
>11
Similarly, you can experiment various combinations of calculations and get the results. In case, you
want to obtain the previous calculation, this can be done in two ways. First, click in R console, and
press Up / Down Arrow key on your keyboard. This will activate the previously executed commands.
Press Enter.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
6/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
But, what if you have done too many calculations ? It would be too painful to scroll through every
command and nd it out. In such situations, creating variable is a helpful way.
In R, you can create a variable using <- or = sign. Lets say I want to create a variable x to compute the
sum of 7 and 8. Ill write it as:
>x<8+7
>x
>15
Once we create a variable, you no longer get the output directly (like calculator), unless you call the
variable in the next line. Remember, variables can be alphabets, alphanumeric but not numeric. You
cant create numeric variables.
Since these classes are self-explanatory by names, I wouldnt elaborate on that. These classes have
attributes. Think of attributes as their identi er, a name or number which aptly identi es them. An
object can have following attributes:
1. names, dimension names
2. dimensions
3. class
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
7/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
4. length
Attributes of an object can be accessed using attributes() function. More on this coming in following
section.
Lets understand the concept of object and attributes practically. The most basic object in R is known
as vector. You can create an empty vector using vector(). Remember, a vector contains object of same
class.
For example: Lets create vectors of di erent classes. We can create vector using c() or concatenate
command also.
>a<c(1.8,4.5)#numeric
>b<c(1+2i,36i)#complex
>d<c(23,44)#integer
>e<vector("logical",length=5)
Data Types in R
R has various type of data types whichincludes vector (numeric, integer etc), matrices, data frames
and list. Lets understand them one by one.
Vector: As mentioned above, a vector contains object of same class. But, you can mix objects of
di erent classes too.When objects of di erent classes are mixed in a list, coercion occurs. This e ect
causes the objects of di erent types to convert into one class. For example:
>qt<c("Time",24,"October",TRUE,3.33)#character
>ab<c(TRUE,24)#numeric
>cd<c(2.5,"May")#character
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
8/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Similarly, you can change the class of any vector. But, you should pay attention here. If you try to
convert a character vector to numeric , NAs will be introduced. Hence, you should be careful to
use this command.
List: A list is a special type of vector which contain elements of di erent data types. For example:
>my_list<list(22,"ab",TRUE,1+2i)
>my_list
[[1]]
[1]22
[[2]]
[1]"ab"
[[3]]
[1]TRUE
[[4]]
[1]1+2i
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
9/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
As you can see, the output of a list is di erent from a vector. This is because, all the objects are of
di erent types. The double bracket [[1]] shows the index of rst element and so on. Hence, you can
easily extract the element of lists depending on their index. Like this:
>my_list[[3]]
>[1]TRUE
You can use [] single bracket too. But, that would return the list element with its index number, instead
of the result above. Like this:
>my_list[3]
>[[1]]
[1]TRUE
Matrices: When a vector is introduced with row and column i.e. a dimension attribute, it becomes a
matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data structure. It
consist of elements of same class. Lets create a matrix of 3 rows and 2 columns:
>my_matrix<matrix(1:6,nrow=3,ncol=2)
>my_matrix
[,1][,2]
[1,]14
[2,]25
[3,]36
> dim(my_matrix)
[1] 3 2
> attributes(my_matrix)
$dim
[1] 3 2
As you can see, the dimensions of a matrix can be obtained using either dim()or attributes()
command. To extract a particular element from a matrix, simply use the index shown above. For
example(try this at your end):
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
10/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
>my_matrix[,2]#extractssecondcolumn
>my_matrix[,1]#extractsfirstcolumn
>my_matrix[2,]#extractssecondrow
>my_matrix[1,]#extractsfirstrow
As an interesting fact, you can also create a matrix from a vector. All you need to do is, assign
dimension dim() later. Like this:
>age<c(23,44,15,12,31,16)
>age
[1]234415123116
>dim(age)<c(2,3)
>age
[,1][,2][,3]
[1,]231531
[2,]441216
>class(age)
[1]"matrix"
You can also join two vectors using cbind() and rbind() functions. But, make sure that both vectors
have same number of elements. If not, it will return NA values.
>x<c(1,2,3,4,5,6)
>y<c(20,30,40,50,60)
>cbind(x,y)
>cbind(x,y)
xy
[1,]120
[2,]230
[3,]340
[4,]450
[5,]560
[6,]670
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
11/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Data Frame: This is the most commonly usedmember of data types family. It is used to store tabular
data. It is di erent from matrix. In a matrix, every element must have same class. But, in a data frame,
you can put list of vectors containing di erent classes. This means, every column of a data frame acts
like a list. Every time you will readdata in R, it will be stored in the form of a data frame. Hence, it is
important to understand the majorly used commands on data frame:
>df<data.frame(name=c("ash","jane","paul","mark"),score=c(67,56,87,91))
>df
namescore
1ash67
2jane56
3paul87
4mark91
>dim(df)
[1]42
>str(df)
'data.frame':4obs.of2variables:
$name:Factorw/4levels"ash","jane","mark",..:1243
$score:num67568791
>nrow(df)
[1]4
>ncol(df)
[1]2
Lets understand the code above. df is the name of data frame. dim() returns the dimension of data
frame as 4 rows and 2 columns. str() returns the structure of a data frame i.e. the list of variables
stored in the data frame. nrow() and ncol() return the number of rows and number of columns in a
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
12/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
are
specially
treated
in
data
set.
For
more
explanation,
click
here
(https://www.analyticsvidhya.com/blog/2015/11/8-ways-deal-continuous-variables-predictivemodeling/).
Lets now understand the concept of missing values in R. This is one of the most painful yet crucial
part of predictive modeling. You must be aware of all techniques to deal with them. The complete
explanation
on
such
techniques
is
provided
here
(https://www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-buildingmodel-part-2/).
Missing values in R are represented by NA and NaN. Now well check if a data set has missing values
(using the same data frame df).
>df[1:2,2]<NA#injectingNAat1st,2ndrowand2ndcolumnofdf
>df
namescore
1ashNA
2janeNA
3paul87
4mark91
>is.na(df)#checkstheentiredatasetforNAsandreturnlogicaloutput
namescore
[1,]FALSETRUE
[2,]FALSETRUE
[3,]FALSEFALSE
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
13/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
[4,]FALSEFALSE
>table(is.na(df))#returnsatableoflogicaloutput
FALSETRUE
62
The use
of na.rm = TRUE parameter tells R to ignore the NAs and compute the mean of remaining
(https:/
/datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
values in the selected column (score). To remove rows with NA values in a data frame, you can use
na.omit:
>new_df<na.omit(df)
>new_df
namescore
3paul87
4mark91
Control Structures in R
As the name suggest, a control structure controls the ow of code / commands written inside a
function. A function is a set of multiple commands written to automate a repetitive coding task.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
14/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
For example: You have 10 data sets. You want to nd the mean of Age column present in every data
set. This can be done in 2 ways: either you write the code to compute mean 10 times or you simply
create a function and pass the data set to it.
Lets understandthe control structures in R with simpleexamples:
if, else This structure is used to test a condition. Below is the syntax:
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
if(<condition>){
##dosomething
}else{
##dosomething
}
Example
#initializeavariable
N<10
(https://datahack.analyticsvidhya.com/contest/skilltest#checkifthisvariable*5is>40
machine-learning/)
if(N*5>40){
print("Thisiseasy!")
}else{
print("It'snoteasy!")
}
[1]"Thisiseasy!"
for This structure is used when a loop is to be executed xed number of times. It is commonly used
for iterating over the elements of an object (list, vector). Below is the syntax:
for(<searchcondition>){
#dosomething
}
Example
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
15/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
#initializeavector
y<c(99,45,34,65,76,23)
#printthefirst4numbersofthisvector
for(iin1:4){
print(y[i])
}
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
[1]99
[1]45
[1]34
[1]65
while It begins by testing a condition, and executes only if the conditionis found to be true. Once
the loop is executed, the condition is tested again. Hence, its necessary to alter the condition such
that the loop doesnt go in nity. Below is the syntax:
(https:/
/datahack.analyticsvidhya.com/contest/skilltest#initializeacondition
machine-learning/)
Age<12
#checkifageislessthan17
while(Age<17){
print(Age)
Age<Age+1#Oncetheloopisexecuted,thiscodebreakstheloop
}
[1]12
[1]13
[1]14
[1]15
[1]16
There are other control structures as well but are less frequently used than explained above. Those
structures are:
1. repeat It executes an in nite loop
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
16/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Note: If you nd the section control structures di cult to understand, not to worry. R is supported by
various packages to compliment the work done by control structures.
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Useful R Packages
Out of ~7800 packages listed on CRAN (https://cran.r-project.org/), Ive listed some of the most
powerful and commonly used packages in predictive modeling in this article. Since, Ive already
explained the method of installing packages, you can go ahead and install them now. Sooner or later
youll need them.
Importing Data:R o ers wide range of packages for importing data available in any format such as
.txt, .csv, .json, .sql etc. To import large les of data quickly, it is advisable to install and use data.table,
But, becomes complex when it comes to creating advanced graphics. Hence, you should install
ggplot2.
Data Manipulation: R has afantastic collection of packages for data manipulation. These packages
allows you to do basic & advanced computations quickly. These packages are dplyr, plyr, tidyr,
lubridate,
stringr.
Check
out
this
complete
(https://www.analyticsvidhya.com/blog/2015/12/faster-data-manipulation-7-packages/)
tutorial
on
manipulation packages in R.
data
Modeling / Machine Learning: For modeling, caret package in R is powerful enough to cater to
every need for creating machine learning model. However, you can install packages algorithms wise
such as randomForest, rpart, gbm etc
Note: Ive only mentioned the commonly used packages. You might like to check this interesting
infographic
(https://www.analyticsvidhya.com/blog/2015/08/list-r-packages-data-analysis/)
on
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
17/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Till here, you becamefamiliar with the basic work style in R and its associated components. From
next section, well begin with predictive modeling. But before you proceed. I want you to practice,
what youve learnt till here.
Practice Assignment: As a part of this assignment, install swirl package in package. Then type,
library(swirl) to initiate the package.And, complete this interactive R tutorial. If you have followed this
article /datahack.analyticsvidhya.com/contest/thethoroughly, this assignment should be an easy task for you!
(https:/
strategic-monk/)
this
tutorial,
Ive
taken
the
data
set
from
Big
Mart
Sales
(https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/).
Prediction
Before
we
Predictor Variable (a.k.a Independent Variable): In a data set, predictor variables (Xi)are those using
which the prediction is made on response variable. (Image below).
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
18/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
(https://www.analyticsvidhya.com/wp-content/uploads/2016/02/PRV.png)
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Train Data: The predictive model is always built on train data set. An intuitive way to identify the train
data is, that it always has the response variable included.
Test Data: Once the model is built, its accuracy is tested on test data. This data always contains less
number of observations than train data set. Also, it does not include response variable.
Right now, you should download the data set. Take a good look at train and test data. Cross check the
information shared above and then proceed.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
19/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
unnecessary directory troubles. Once the directory is set, we can easily import the .csv les using
commands below.
#LoadDatasets
train<read.csv("Train_UWu5bXk.csv")
test<read.csv("Test_u94Q5KV.csv")
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
In fact, even prior to loading data in R, its a good practice to look at the data in Excel. This helps in
strategizing the complete prediction modeling process. To check if the data set has been loaded
successfully, look at R environment. The data can be seen there. Lets explore the data quickly.
#checkdimesions(numberofrow&columns)indataset
>dim(train)
[1]852312
>dim(test)
[1]568111
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
We have 8523 rows and 12 columns in train data set and 5681 rows and 11 columns in data set. This
makes sense. Test data should always have one column less (mentioned above right?). Lets get
deeper in train data set now.
#checkthevariablesandtheirtypesintrain
>str(train)
'data.frame':8523obs.of12variables:
$Item_Identifier:Factorw/1559levels"DRA12","DRA24",..:157966311221298759697739
441991...
$Item_Weight:num9.35.9217.519.28.93...
$Item_Fat_Content:Factorw/5levels"LF","lowfat",..:3535355355...
$Item_Visibility:num0.0160.01930.016800...
$Item_Type:Factorw/16levels"BakingGoods",..:515117101141466...
$Item_MRP:num249.848.3141.6182.153.9...
$Outlet_Identifier:Factorw/10levels"OUT010","OUT013",..:104101242683...
$Outlet_Establishment_Year:int1999200919991998198720091987198520022007...
$Outlet_Size:Factorw/4levels"","High","Medium",..:3331232311...
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
20/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
$Outlet_Location_Type:Factorw/3levels"Tier1","Tier2",..:1313333322...
$Outlet_Type:Factorw/4levels"GroceryStore",..:2321232422...
$Item_Outlet_Sales:num37354432097732995.. .
In train data set, we have 1463 missing values. Lets check the variables in which these values are
missing. Its important to nd and locate these missing values. Many data scientists have repeatedly
advised beginners to pay close attention to missing value in data exploration stages.
>colSums(is.na(train))
Item_IdentifierItem_Weight
(https://datahack.analyticsvidhya.com/contest/skilltest01463
machine-learning/)
Item_Fat_ContentItem_Visibility
00
Item_TypeItem_MRP
00
Outlet_IdentifierOutlet_Establishment_Year
00
Outlet_SizeOutlet_Location_Type
00
Outlet_TypeItem_Outlet_Sales
00
Hence, we see that column Item_Weight has 1463 missing values. Lets getmore inferences from this
data.
>summary(train)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
21/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Here are some quick inferences drawn from variables in train data set:
1. Item_Fat_Content has mis-matched factor levels.
2. Minimum value of item_visibility is 0. Practically, this is not possible. If an item occupies shelf space in a
grocery store, it ought to have some visibility. Well treat all 0s as missing values.
3. Item_Weight has 1463 missing values (already explained above).
4. Outlet_Size has a unmatched factor levels.
(https://datahack.analyticsvidhya.com/contest/theThese inference will help us in treating these variable more accurately.
strategic-monk/)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
22/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
We can see that majority of sales has been obtained from products having visibility less than 0.2. This
suggests that item_visibility < 2 must be an important factor in determining sales. Lets plot few more
interesting graphs and explore such hidden stories.
>ggplot(train,aes(Outlet_Identifier,Item_Outlet_Sales))+geom_bar(stat="identity",color
(https:/
/datahack.analyticsvidhya.com/contest/skilltest="purple")+theme(axis.text.x=element_text(angle=70,vjust=0.5,color="black"))+
machine-learning/)
ggtitle("OutletsvsTotalSales")+theme_bw()
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
23/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Here, we infer that OUT027 has contributed to majority of sales followed by OUT35. OUT10 and OUT19
have probably the least footfall, thereby contributing to the least outlet sales.
>ggplot(train,aes(Item_Type,Item_Outlet_Sales))+geom_bar(stat="identity")
+theme(axis.text.x=element_text(angle=70,vjust=0.5,color="navy"))+xlab("Item
Type")+ylab("ItemOutletSales")+ggtitle("ItemTypevsSales")
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
From this graph, we can infer that Fruits and Vegetables contribute to the highest amount of outlet
sales followed by snack foods and household products. This information can also be represented
using a box plot chart. The bene t of using a box plot is, you get to see the outlier and mean deviation
of corresponding levels of a variable (shown below).
>ggplot(train,aes(Item_Type,Item_MRP))+geom_boxplot()+ggtitle("BoxPlot")+
theme(axis.text.x=element_text(angle=70,vjust=0.5,color="red"))+xlab("ItemType")
+ylab("ItemMRP")+ggtitle("ItemTypevsItemMRP")
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
24/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
The black point you see, is an outlier. The mid line you see in the box, is the mean value of each item
type. To know more about boxplots, check this tutorial
(https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/).
Now, we
have an idea of the variables and their importance on response variable. Lets now move
(https:/
/datahack.analyticsvidhya.com/contest/skilltestback to where we started. Missing values. Now well impute the missing values.
machine-learning/)
We saw variable Item_Weight has missing values. Item_Weight is an continuous variable. Hence, in
this case we can impute missing values with mean / median of item_weight. These are the most
commonly used methods of imputing missing value. To explore other methods of this techniques,
check out this tutorial (https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/).
Lets rst combine the data sets. This will save our time as we dont need to write separate codes
for
train and test data sets. To combine the two data frames, we must make sure that they have equal
columns, which is not the case.
>dim(train)
[1]852312
>dim(test)
[1]568111
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
25/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Test data set has one less column (response variable). Lets rst add the column. We can give this
column any value. An intuitive approach would be to extract the mean value of sales from train data
set and use it as placeholder for test variable Item _Outlet_ Sales. Anyways, lets make it simple for
now. Ive taken a value 1. Now, well combine the data sets.
>test$Item_Outlet_Sales<1
>combi<rbind(train,test)
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Impute missing value by median. Im using median because it is known to be highly robust to outliers.
Moreover, for this problem, our evaluation metric is RMSE
(https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-errormetrics/)which is also highly a ected by outliers. Hence, median is better in this case.
>combi$Item_Weight[is.na(combi$Item_Weight)]<median(combi$Item_Weight,na.rm=TRUE)
>table(is.na(combi$Item_Weight))
FALSE
14204
(https://datahack.analyticsvidhya.com/contest/skilltest
machine-learning/)
Lets take up Item_Visibility. In the graph above, we saw item visibility has zero value also, which is
practically not feasible. Hence, well consider it as a missing value and once again make the
imputation using median.
>combi$Item_Visibility<ifelse(combi$Item_Visibility==0,
median(combi$Item_Visibility),combi$Item_Visibility)
Lets proceed to categorical variables now. During exploration, we saw there are mis-matched levels
in variables which needs to be corrected.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
26/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
>levels(combi$Outlet_Size)[1]<"Other"
>library(plyr)
>combi$Item_Fat_Content<revalue(combi$Item_Fat_Content,
c("LF"="LowFat","reg"="Regular"))
>combi$Item_Fat_Content<revalue(combi$Item_Fat_Content,c("lowfat"="LowFat"))
>table(combi$Item_Fat_Content)
(https:/
/datahack.analyticsvidhya.com/contest/theLowFatRegular
strategic-monk/)
91855019
Using the commands above, Ive assigned the name Other to unnamed level in Outlet_Size variable.
Rest, Ive simply renamed the various levels of Item_Fat_Content.
4. Data Manipulation in R
Lets call it as, the advanced level of data exploration. In this section well practically learn about
feature engineering and other useful aspects.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Feature Engineering: This component separates an intelligent data scientist from a technically
enabled data scientist. You might have access to large machines to run heavy computations and
algorithms, but the power delivered by new features, just cant be matched. We create new variables
to extract and provide as much new information to the model, to help it make accurate predictions.
If you have been thinking all this time, great. But now is the time to think deeper. Look at the data set
and ask yourself, what else (factor) could in uence Item_Outlet_Sales ? Anyhow, the answer is below.
But, I want you to try it out rst, before scrolling down.
1. Count of Outlet Identi ers There are 10 unique outlets in this data. This variable will give us
information on count of outlets in the data set. More the number of counts of an outlet, chances are
more will be the sales contributed by it.
>library(dplyr)
>a<combi%>%
group_by(Outlet_Identifier)%>%
tally()
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
27/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
>head(a)
Source:localdataframe[6x2] Outlet_Identifiern
(fctr)(int)
1OUT010925
2OUT0131553
3OUT0171543
(https:/
/datahack.analyticsvidhya.com/contest/the4OUT0181546
strategic-monk/)
5OUT019880
6OUT0271559
>names(a)[2]<"Outlet_Count"
>combi<full_join(a,combi,by="Outlet_Identifier")
As you can see, dplyr package makes data manipulation quite e ortless. You no longer need to write
long function. In the code above, Ive simply stored the new data frame in a variable a. Later, the new
column Outlet_Countis added in our original combi data set. To know more about dplyr, follow this
tutorial (https://rpubs.com/bradleyboehmke/data_wrangling).
(https:/
/datahack.analyticsvidhya.com/contest/skilltest
machine-learning/)
2. Count of Item Identi ers Similarly, we can compute count of item identi ers too. Its a good
practice to fetch more information from unique ID variables using their count. This will help us to
understand, which outlet has maximum frequency.
>b<combi%>%
group_by(Item_Identifier)%>%
tally()
>names(b)[2]<"Item_Count"
>head(b)
Item_IdentifierItem_Count
(fctr)(int)
1DRA129
2DRA2410
3DRA5910
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
28/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
4DRB018
5DRB139
6DRB248
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
3. Outlet Years This variable represent the information of existence of a particular outlet since year
2013.
Why
just
2013?
Youll
nd
the
answer
in
problem
statement
(http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-prediction).
here
My
hypothesis is, older the outlet, more footfall, large base of loyal customers and larger the outlet sales.
>c<combi%>%
select(Outlet_Establishment_Year)%>%
mutate(Outlet_Year=2013combi$Outlet_Establishment_Year)
>head(c)
Outlet_Establishment_YearOutlet_Year
(https:/
/datahack.analyticsvidhya.com/contest/skilltest1199914
machine-learning/)
220094
3199914
4199815
5198726
620094
>combi<full_join(c,combi)
This suggests that outlets established in 1999 were 14 years old in 2013 and so on.
4. Item Type New Now, pay attention to Item_Identi ers. We are about to discover a new
trend. Look carefully, there is a pattern in the identi ers starting with FD,DR,NC. Now, check the
corresponding Item_Types to these identi ers in the data set. Youll discover, items corresponding to
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
29/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
DR, are mostly eatables. Items corresponding to FD, are drinks. And, item corresponding to NC,
are products which cant be consumed, lets call them non-consumable. Lets extract these variables
into a new variable representing their counts.
Here Ill use substr(), gsub() function to extract and rename the variables respectively.
>q<substr(combi$Item_Identifier,1,2)
(https://datahack.analyticsvidhya.com/contest/the>q<gsub("FD","Food",q)
strategic-monk/)
>q<gsub("DR","Drinks",q)
>q<gsub("NC","NonConsumable",q)
>table(q)
DrinksFoodNonConsumable
1317102012686
Lets now add this information in our data set with a variable name Item_Type_New.
>combi$Item_Type_New<q
Ill leave
the rest of feature engineering intuition to you. You can think of more variables which could
(https:/
/datahack.analyticsvidhya.com/contest/skilltestadd more information to the model. But make sure, the variable arent correlated. Since, they are
machine-learning/)
emanating from a same set of variable, there is a high chance for them to be correlated. You can
check the same in R using cor() function.
Label Encoding, in simple words, is the practice of numerically encoding (replacing) di erent levels of
a categorical variables. For example: In our data set, the variable Item_Fat_Contenthas 2 levels: Low
Fat and Regular. So, well encode Low Fat as 0 and Regular as 1. This will help us convert a factor
variable in numeric variable. This can be simply done using if else statement in R.
>combi$Item_Fat_Content<ifelse(combi$Item_Fat_Content=="Regular",1,0)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
30/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
One Hot Encoding, in simple words, is the splitting a categorical variable into its unique levels,
andeventually removing the original variable from data set. Confused ? Heres an example: Lets take
any categorical variable, say, Outlet_ Location_Type. It has 3 levels. One hot encoding of this variable,
will create 3 di erent variables consisting of 1s and 0s. 1s will represent the existence of variable and
0s will represent non-existence of variable. Let look at a sample:
>sample<select(combi,Outlet_Location_Type)
(https:/
/datahack.analyticsvidhya.com/contest/the>demo_sample<data.frame(model.matrix(~.1,sample))
strategic-monk/)
>head(demo_sample)
Outlet_Location_TypeTier.1Outlet_Location_TypeTier.2Outlet_Location_TypeTier.3
1100
2001
3100
4001
5001
6001
model.matrix creates a matrix of encoded variables. ~. -1 tells R, to encode all variables in the data
(https://datahack.analyticsvidhya.com/contest/skilltestframe, but suppress the intercept. So, what will happen if you dont write -1 ? model.matrix will skip
machine-learning/)
the rst level of the factor, thereby resulting in just 2 out of 3 factor levels (loss of information).
This was the demonstration of one hot encoding. Hope you have understood the concept now. Lets
now apply this technique to all categorical variables in our data set (excluding ID variable).
>library(dummies)
>combi<dummy.data.frame(combi,names=
c('Outlet_Size','Outlet_Location_Type','Outlet_Type','Item_Type_New'),sep='_')
With this, I have shared 2 di erent methods of performing one hot encoding in R. Lets check if
encoding has been done.
>str(combi)
$Outlet_Size_Other:int0110100000...
$Outlet_Size_High:int0001000000...
$Outlet_Size_Medium:int1000001101...
$Outlet_Size_Small:int0000010010...
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
31/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
$Outlet_Location_Type_Tier1:int1000000010...
$Outlet_Location_Type_Tier2:int0100110000...
$Outlet_Location_Type_Tier3:int0011001101...
$Outlet_Type_GroceryStore:int0010000000...
$Outlet_Type_SupermarketType1:int1101110010...
$Outlet_Type_SupermarketType2:int0000000100...
(https:/
/datahack.analyticsvidhya.com/contest/the$Outlet_Type_SupermarketType3:int0000001001...
strategic-monk/)
$Item_Outlet_Sales:num1382928425532553...
$Year:num14111526692841628...
$Item_Type_New_Drinks:int1111111111...
$Item_Type_New_Food:int0000000000...
$Item_Type_New_NonConsumable:int0000000000...
As you can see, after one hot encoding, the original variables are removed automatically from the
data set.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Finally, well drop the columns which have either been converted using other variables or are
identi er variables. This can be accomplished using select from dplyr package.
>combi<select(combi,c(Item_Identifier,Outlet_Identifier,Item_Fat_Content,
Outlet_Establishment_Year,Item_Type))
>str(combi)
In this section, Ill cover Regression, Decision Trees and Random Forest. A detailed explanation of
these algorithms is outside the scope of this article. These algorithms have been satisfactorily
explained in our previous articles.Ive provided the links for useful resources.
As you can see, we have encoded all our categorical variables. Now, this data set is good to
takeforward to modeling. Since, we started from Train and Test, lets now divide the data sets.
>new_train<combi[1:nrow(train),]
>new_test<combi[(1:nrow(train)),]
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
32/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Linear (Multiple)Regression
Multiple Regression is used when response variable is continuous in nature and predictors are many.
Had it been categorical, we would have used Logistic Regression. Before you proceed, sharpen
your basics of Regression here (https://www.analyticsvidhya.com/blog/2015/08/comprehensive-
(https://datahack.analyticsvidhya.com/contest/theguide-regression/).
strategic-monk/)
Lets now build out rst regression model on this data set. R uses lm() function for regression.
(https://datahack.analyticsvidhya.com/contest/skilltest>linear_model<lm(Item_Outlet_Sales~.,data=new_train)
machine-learning/)
>summary(linear_model)
Adjusted R measures the goodness of t of a regression model. Higher the R, better is the model.
Our R = 0.2085. It means we really did something drastically wrong.Lets gure it out.
In our case, I could nd our new variables arent helping much i.e. Item count, Outlet Count and
Item_Type_New. Neither of these variables are signi cant. Signi cant variables are denoted by * sign.
As we know, correlated predictor variables brings down the model accuracy. Lets nd out the
amount of correlation present in our predictor variables. This can be simply calculated using:
>cor(new_train)
Alternatively, you can also use corrplot package for some fancy correlation plots. Scrolling through
the long list of correlation coe cients, I could nd a deadly correlation coe cient:
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
33/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
cor(new_train$Outlet_Count,new_train$`Outlet_Type_GroceryStore`)
[1]0.9991203
Outlet_Count is highly correlated (negatively) with Outlet Type Grocery Store. Here are some
problems I could nd in this model:
1. We have correlated predictor variables.
(https://datahack.analyticsvidhya.com/contest/the2. We did one hot encoding and label encoding. Thats not necessary since linear regression handle
strategic-monk/)
categorical variables by creating dummy variables intrinsically.
3. The new variables (item count, outlet count, item type new) created in feature engineering are not
signi cant.
Lets try to create a more robust regression model. This time, Ill be using a building a simple model
without encoding and new features. Below is the entire code:
#loaddirectory
>path<"C:/Users/manish/desktop/Data/February2016"
>setwd(path)
(https://datahack.analyticsvidhya.com/contest/skilltest#loaddata
machine-learning/)
>train<read.csv("train_Big.csv")
>test<read.csv("test_Big.csv")
#createanewvariableintestfile
>test$Item_Outlet_Sales<1
#combinetrainandtestdata
>combi<rbind(train,test)
#imputemissingvalueinItem_Weight
>combi$Item_Weight[is.na(combi$Item_Weight)]<median(combi$Item_Weight,na.rm=TRUE)
#impute0initem_visibility
>combi$Item_Visibility<ifelse(combi$Item_Visibility==0,median(combi$Item_Visibility),
combi$Item_Visibility)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
34/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
#renamelevelinOutlet_Size
>levels(combi$Outlet_Size)[1]<"Other"
#renamelevelsofItem_Fat_Content
>library(plyr)
>combi$Item_Fat_Content<revalue(combi$Item_Fat_Content,c("LF"="LowFat","reg"=
"Regular"))
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
>combi$Item_Fat_Content<revalue(combi$Item_Fat_Content,c("lowfat"="LowFat"))
#createanewcolumn2013Year
>combi$Year<2013combi$Outlet_Establishment_Year
#dropvariablesnotrequiredinmodeling
>library(dplyr)
>combi<select(combi,c(Item_Identifier,Outlet_Identifier,Outlet_Establishment_Year))
#dividedataset
>new_train<combi[1:nrow(train),]
(https:/
/datahack.analyticsvidhya.com/contest/skilltest>new_test<combi[(1:nrow(train)),]
machine-learning/)
#linearregression
>linear_model<lm(Item_Outlet_Sales~.,data=new_train)
>summary(linear_model)
Now we have got R = 0.5623. This teaches us that, sometimes all you need is simple thought process
to get high accuracy. Quite a good improvement from previous model. Next, time when you work on
any model, always remember to start with a simple model.
Lets check out regression plot to nd out more ways to improve this model.
>par(mfrow=c(2,2))
>plot(linear_model)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
35/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
You can zoom these graphs in R Studio at your end. All these plots have a di erent story to tell. But
(https:/
/datahack.analyticsvidhya.com/contest/skilltestthe most
important story is being portrayed by Residuals vs Fitted graph.
machine-learning/)
Residual values are the di erence between actual and predicted outcome values. Fitted values are
the predicted values. If you see carefully, youll discover it as a funnel shape graph (from right to left ).
The shape of this graph suggests that our model is su ering from heteroskedasticity (unequal
variance in error terms). Had there been constant variance, there would be no pattern visible in this
graph.
A common practice to tackle heteroskedasticity is by taking the log of response variable. Lets
do it
and check if we can get further improvement.
>linear_model<lm(log(Item_Outlet_Sales)~.,data=new_train)
>summary(linear_model)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
36/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
And, heres a snapshot of my model output. Congrats! We have got an improved model with R = 0.72.
Now, we are on the right path. Once again you can check the residual plots (you might zoom it). Youll
nd there is no longer a trend in residual vs tted value plot.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
This model can be further improved by detecting outliers and high leverage points. For now, I leave
that part to you! I shall write a separate post on mysteries of regression soon. For now, lets check our
RMSE so that we can compare it with other algorithms demonstrated below.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
37/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/theLets proceed to decision tree algorithm and try to improve our RMSE score.
strategic-monk/)
Decision Trees
Before you start, Id recommend you to glance through the basics of decision tree algorithms. To
understand
what
makes
it
superior
than
linear
regression,
check
(https://www.analyticsvidhya.com/blog/2015/01/decision-tree-simpli ed/)
this
tutorial Part
and
Part
1
2
(https://www.analyticsvidhya.com/blog/2015/01/decision-tree-algorithms-simpli ed/).
In R, decision tree algorithm can be implemented using rpart package. In addition, well use caret
(https:/
/datahack.analyticsvidhya.com/contest/skilltestpackage
for doing cross validation. Cross validation is a technique to build robust modelswhich are
machine-learning/)
not
prone
to
over tting.
Read
more
about
Cross
Validation
(https://www.analyticsvidhya.com/blog/2015/11/improve-model-performance-cross-validation-inpython-r/).
In R, decision tree uses a complexity parameter (cp). It measures the tradeo
between model
complexity and accuracy on training set. A smaller cp will lead to a bigger tree, which might over t
the model. Conversely, a large cp value might under t the model. Under tting occurs when the
model does not capture underlying trends properly. Lets nd out the optimum cp value for our
model with 5 fold cross validation.
#loadingrequiredlibraries
>library(rpart)
>library(e1071)
>library(rpart.plot)
>library(caret)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
38/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
#settingthetreecontrolparameters
>fitControl<trainControl(method="cv",number=5)
>cartGrid<expand.grid(.cp=(1:50)*0.01)
#decisiontree
>tree_model<train(Item_Outlet_Sales~.,data=new_train,method="rpart",trControl=
fitControl,tuneGrid=cartGrid)
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
>print(tree_model)
The nal value for cp = 0.01. You can also check the table populated in console for more information.
The model with cp = 0.01 has the least RMSE. Lets now build a decision tree with 0.01 as complexity
parameter.
>main_tree<rpart(Item_Outlet_Sales~.,data=new_train,control=
rpart.control(cp=0.01))
>prp(main_tree)
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Here is the tree structure of our model. If you have gone through the basics, you would now
understand that this algorithm has marked Item_MRP as the most important variable (being the root
node). Lets check the RMSE of this model and see if this is any better than regression.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
39/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
>pre_score<predict(main_tree,type="vector")
>rmse(new_train$Item_Outlet_Sales,pre_score)
[1]1102.774
As you can see, our RMSE has further improvedfrom 1140 to 1102.77 with decision tree. To improve
this score further, you can further tune the parameters for greater accuracy.
(https://datahack.analyticsvidhya.com/contest/the
strategic-monk/)
Random Forest
Random Forest is a powerful algorithm which holistically takes care of missing values, outliers and
other non-linearities in the data set. Its simply a collection of classi cation trees, hence the name
forest. Id suggest you to quickly refresh your basics of random forest with this tutorial
(https://www.analyticsvidhya.com/blog/2015/09/random-forest-algorithm-multiple-challenges/).
In R, random forest algorithm can be implement using randomForest package. Again, well use train
package for cross validation and nding optimum value of model parameters.
(https:/
For this/datahack.analyticsvidhya.com/contest/skilltestproblem, Ill focus on two parameters of random forest. mtry and ntree.ntree is the number
machine-learning/)
of trees to be grown in the forest. mtryis the number of variables taken at each node to build a tree.
And, well do a 5 fold cross validation.
Lets do it!
#loadrandomForestlibrary
>library(randomForest)
#settuningparameters
>control<trainControl(method="cv",number=5)
#randomforestmodel
>rf_model<train(Item_Outlet_Sales~.,data=new_train,method="parRF",trControl=
control, prox=TRUE,allowParallel=TRUE)
#checkoptimalparameters
>print(rf_model)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
40/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
If you notice, youll see Ive used method = parRF. This is parallel random forest. This is parallel
implementation of random forest. This package causes your local machine to take less time in
random forest computation. Alternatively, you can also use method = rf as a standard random forest
function.
Now weve got the optimal value of mtry = 15. Lets use 1000 trees for computation.
(https://datahack.analyticsvidhya.com/contest/skilltest#randomforestmodel
machine-learning/)
>forest_model<randomForest(Item_Outlet_Sales~.,data=new_train,mtry=15,ntree=
1000)
>print(forest_model)
>varImpPlot(forest_model)
This model throws RMSE = 1132.04 which is not an improvement over decision tree model. Random
is
forest has a feature of presenting the important variables. We see that the most important variable
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
41/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
This model can be further improved by tuning parameters. Also,Lets make out rst submission with
our best RMSE score by decision tree.
>main_predict<predict(main_tree,newdata=new_test,type="vector")
(https:/
/datahack.analyticsvidhya.com/contest/skilltest>sub_file<data.frame(Item_Identifier=test$Item_Identifier,Outlet_Identifier=
machine-learning/)
test$Outlet_Identifier,Item_Outlet_Sales=main_predict)
>write.csv(sub_file,'Decision_tree_sales.csv')
When predicted on out of sample data, our RMSE has come out to be 1174.33.Here are some things
you can do to improve this model further:
1. Since we did not use encoding, I encourage you to use one hot encoding and label encoding for
Do implement the ideas suggested above and share your improvement in the comments section
below. Currently, Rank 1 on Leaderboard (http://datahack.analyticsvidhya.com/contest/practiceproblem-big-mart-sales-iii/lb) has obtained RMSE score of 1137.71. Beat it!
End Notes
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
42/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
End Notes
This brings us to the end of this tutorial. Regret for not so happy ending. But, Ive given you enough
hints to work on. The decision to not use encoded variables in the model, turned out to be bene cial
until decision trees.
The motive
of this tutorial was to get your started with predictive modeling in R. We learnt few
(https:/
/datahack.analyticsvidhya.com/contest/theuncanny things such as build simple models. Dont jump towards building a complex model. Simple
strategic-monk/)
models give you benchmark score and a threshold to work with.
In this tutorial, I have demonstrated the steps used in predictive modeling in R. Ive covered data
exploration, data visualization, data manipulation and building models using Regression, Decision
Trees and Random Forest algorithms.
Did you nd this tutorial useful ? Are you facing any trouble at any stage of this tutorial ? Feel free to
mention your doubts in the comments section below. Do share if you get a better score.
Edit: On visitors request, the PDF version of the tutorial is available for download. You need to create
a log /datahack.analyticsvidhya.com/contest/skilltestin account to download the PDF. Also, you can bookmark this page for future reference.
(https:/
Download Here (http://discuss.analyticsvidhya.com/t/download-free-tutorial-to-learn-data-sciencemachine-learning/)
in-r-from-scratch/7797/2).
You want to apply your analytical skills and test your potential?
Thenparticipate in our Hackathons
(http://datahack.analyticsvidhya.com/contest/all)and compete with TopData
Scientists from all over the world.
Share this:
(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=linkedin&nb=1)
744
(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=facebook&nb=1)
47
(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=googleplus1&nb=1)
(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=twitter&nb=1)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
43/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=pocket&nb=1)
(https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/?share=reddit&nb=1)
RELATED
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
(https://www.analyticsvidhya.com/
(https://www.analyticsvidhya.com/
(https://www.analyticsvidhya.com/
blog/2016/09/most-active-data-
blog/2015/12/faster-data-
blog/2016/10/18-new-must-read-
scientists-free-books-notebooks-
manipulation-7-packages/)
books-for-data-scientists-on-r-and-
tutorials-on-github/)
python/)
18 New Must Read Books for Data
Scientists on R and Python
(https://www.analyticsvidhya.com/
blog/2016/10/18-new-must-readbooks-for-data-scientists-on-r-andpython/)
In "Machine Learning"
In "Machine Learning"
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
44/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Previous Article
Next Article
(https://www.analyticsvidhya.com/blog/2016/02/guide- (https://www.analyticsvidhya.com/blog/2016/03/completebuild-predictive-models-segmentation/)
guide-parameter-tuning-xgboost-with-codespython/)
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
(https://www.analyticsvidhya.com/blog/author/avcontentteam/)
Author
45/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://www.analyticsvidhya.com/blog/author/avcontentteam/)
Analytics Vidhya Content team
105 COMMENTS
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
SteveREPLY
(http://www.bigewisdom.net/)
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106335#RESPOND)
FEBRUARY 29, 2016 AT 3:46 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106335)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106343#RESPOND)
FEBRUARY 29, 2016 AT 6:13 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106343)
Welcome Steve. I can make that available. Ill email it to you shortly.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Abhijit
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106351#RESPOND)
FEBRUARY 29, 2016 AT 7:45 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106351)
Please make it(PDF version) available for all the users as well. It will help a lot in a nutshell.
Hemant
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106366#RESPOND)
FEBRUARY 29, 2016 AT 11:07 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106366)
Manish nice content for Beginners. Thanks ! I also want this content in PDF format. Please mail this
content in PDF format to me also.
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106382#RESPOND)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
46/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Hi Hemant
PDF is available for download. Link is added in the tutorial at the end.
(https://datahack.analyticsvidhya.com/contest/theHemant says:
strategic-monk/)
MARCH 6, 2016 AT 1:48 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106710)
Hi Hemant
Link is working ne. You need to create a one time user login to download the PDF.
SCRATCH/#COMMENT-112402)
Hi Manish,
We are looking for R language experts with good understanding on Data Science. Required an
expert to write a book on R language using Data Science. Interested writers/experts please
contact with latest pro le at alpinessolutions at gmail dot com.
midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107001#RESPOND)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
47/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Sir, I couldnt nd the datasets mentioned in the article. Can you please guide me where can i get
the data sets. Thanks.
(https://datahack.analyticsvidhya.com/contest/theElan says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=111173#RESPOND)
strategic-monk/)
MAY 19, 2016 AT 8:59 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-111173)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112097#RESPOND)
JUNE 10, 2016 AT 11:07 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112097)
Hi Elan
Please download the data from here: http://datahack.analyticsvidhya.com/contest/practice(https://datahack.analyticsvidhya.com/contest/skilltestproblem-big-mart-sales-iii (http://datahack.analyticsvidhya.com/contest/practice-problem-bigmachine-learning/)
mart-sales-iii)
bgreddy
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=116143#RESPOND)
SEPTEMBER 16, 2016 AT 6:18 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-116143)
Dr.D.K.Samuel
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106336#RESPOND)
FEBRUARY 29, 2016 AT 4:09 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106336)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106344#RESPOND)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
48/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Welcome Samuel !
Himanshu
Dhingra (http://www.gutargoo.com) says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106346#RESPOND)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106384#RESPOND)
FEBRUARY 29, 2016 AT 3:21 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106384)
(https://datahack.analyticsvidhya.com/contest/skilltestThanks Himanshu ! PDF is available for download. Link is added at the end of tutorial.
machine-learning/)
Krishna
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106353#RESPOND)
FEBRUARY 29, 2016 AT 8:25 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106353)
Devendra
Yadav says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106359#RESPOND)
FEBRUARY 29, 2016 AT 9:26 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106359)
Hi Manish
Could you please share the pdf with me as well. I am a starter in R and this can help as a compact
guide for myself when trying out di erent things.
Thanks
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
49/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Rad REPLY
Mou(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106360#RESPOND)
says:
FEBRUARY 29, 2016 AT 9:33 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106360)
Hello, when I type log(12) I get 2.484907 as a result. What seems to be the problem ?
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Ram REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106391#RESPOND)
FEBRUARY 29, 2016 AT 4:38 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106391)
@RadMou,
It seems that there is a typo in the article. The fact is: log uses base e ; log10 uses base 10 and
log2 uses base 2.
You can see that these commands print di erent values:
log(12) # log to the base e
log10(12) # log to the base 10
(https://datahack.analyticsvidhya.com/contest/skilltestlog2(12) # log to the base 2
machine-learning/)
Hope this helps.
Zamin
Sherazi
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106361#RESPOND)
FEBRUARY 29, 2016 AT 9:58 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106361)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106383#RESPOND)
FEBRUARY 29, 2016 AT 3:19 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106383)
Hi Zamin
PDF is available for download.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
50/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Monil
Doshi
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106364#RESPOND)
FEBRUARY 29, 2016 AT 10:23 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106364)
Hi Manish,
This is very helpful for beginners like me.
Looking forward for more.
Is there any way I can get this in PDF format?
(https://datahack.analyticsvidhya.com/contest/theIt would be really helpful
strategic-monk/)
My email id is [email protected] (mailto:[email protected]).
Thank you very much!.
Aanish
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106367#RESPOND)
FEBRUARY 29, 2016 AT 11:19 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106367)
Thanks Manish. This is a great help! I have a questions I noticed that R automatically takes care
of the factor variables (by converting them to n or n-1 dummy variables) while performing linear
regression. Do you recommend that we do it explicitly?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106379#RESPOND)
FEBRUARY 29, 2016 AT 3:01 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106379)
Hi Anish
In case of linear regression, decision trees, random forest, kNN, it is not necessary to convert
categorical variables explicitly as these algorithms intrinsically breaks a categorical variables with
n 1 levels. However, if you are using boosting algorithms (GBM, XGboost) it is recommended to
encode categorical variables prior to modeling. On a similar note, if you have followed this tutorial
youll nd that I started with one hot encoding and got a terrible regression accuracy. Later, I used
the categorical variables as it as, and accuracy improved.
kishor
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106368#RESPOND)
FEBRUARY 29, 2016 AT 11:55 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106368)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
51/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
chandrakala
(http://-) says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106369#RESPOND)
FEBRUARY 29, 2016 AT 12:15 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106369)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106381#RESPOND)
FEBRUARY 29, 2016 AT 3:17 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106381)
Welcome !
Raman
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106372#RESPOND)
FEBRUARY 29, 2016 AT 1:26 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106372)
Manish,
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Very valuable tutorial. TY. If it is not too much of a trouble. Can you please make a PDF version as a
link on the tutorial, please. Thanks.
Regards
Raman
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106380#RESPOND)
FEBRUARY 29, 2016 AT 3:17 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106380)
Hi Raman
Ive added the PDF link at the end of this tutorial.
Atul REPLY
Khairnar
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106373#RESPOND)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
52/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Thanks for sharing this article. This is really help to us. When I ran these script on Rstudio I got two
errors for ggplot after I tried install.packages(ggplot2) AND
install.packages(ggplot2,dependencies = TRUE) and I got the following error
> ggplot(train, aes(x= Item_Visibility, y = Item_Outlet_Sales)) + geom_point(size = 2.5, color=navy) +
xlab(Item Visibility) + ylab(Item Outlet Sales)
(https://datahack.analyticsvidhya.com/contest/theError: could not nd function ggplot
strategic-monk/)
And also for merge data
> combi <- merge(b, combi, by = "Outlet_Identi er")
Error in x.by(by.x, x) : 'by' must specify a uniquely valid column
Can you help me why this happen.
Once again 'Thank You So Much' because I learn new things about R.
Thanks,
Atul
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106377#RESPOND)
FEBRUARY 29, 2016 AT 2:42 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106377)
Hi Atul
After installing the ggplot2 package, you should call the package in the next step using
library(ggplot2).
Then run the ggplot code, it should work.
merge function is used from package plyr. Have you installed it ? Let me know.
Atul Khairnar
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106421#RESPOND)
MARCH 1, 2016 AT 6:39 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106421)
Thanks Manish, I tried manually as well as by syntax through but still showing following error
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
53/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
install.packages(plyr)
library(plyr)
combi library(plyr)
Warning message:
package plyr was built under R version 3.1.3
> combi <- merge(b, combi, by = "Outlet_Identi er") ##########Error showing####
Error in x.by(by.x, x) : 'by' must specify a uniquely valid column
(https://datahack.analyticsvidhya.com/contest/theCan you please help me on thiswhy this error showing
strategic-monk/)
Arfath
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106911#RESPOND)
MARCH 9, 2016 AT 8:01 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106911)
shashi
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106378#RESPOND)
FEBRUARY 29, 2016 AT 2:48 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106378)
mouradelghissassi1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106388#RESPOND)
FEBRUARY 29, 2016 AT 3:37 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106388)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
54/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
And not
OUT027 2215.876
OUT035 1463.705
So the command
combi <- merge(b, combi, by = "Outlet_Identi er") should be
combi <- merge(b, combi, by = "Item_Identi er") instead
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Also in head(c) there is a problem with the years, all rows are for 1985.
mouradelghissassi1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106392#RESPOND)
FEBRUARY 29, 2016 AT 4:40 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106392)
Hence, we see that column Item_Visibility has 1463 missing values. Lets get more inferences
from this data. its the Item_Weight variable that has missing values
Also in Label Encoding and One Hot Encoding : the variable Item_Visibility has 2 levels: Low Fat
and Regular
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Its Item_Fat_Content not Item_Visibility
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106410#RESPOND)
MARCH 1, 2016 AT 4:02 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106410)
Hi
Thank you so much! Editing error. Recti ed now.
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106411#RESPOND)
MARCH 1, 2016 AT 4:22 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106411)
Hi
Thanks for pointing out. Made the changes.
In head(c), I wanted to show that using the mutate command, count value of years get
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
55/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
automatically aligned to their particular year value. Hence, I sorted it. For example, the year 1985
would get 25 as count value at all the places in count column. Anyways, Ive put a better picture of
year count now.
Hope this helps.
Balaji
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106434#RESPOND)
(https://datahack.analyticsvidhya.com/contest/theMARCH 1, 2016 AT 11:42 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCEstrategic-monk/)
SCRATCH/#COMMENT-106434)
Hi Manish,
I am unable to download the pdf as i get a blank page. Kindly check
balajimadhav
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106435#RESPOND)
MARCH 1, 2016 AT 11:49 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106435)
Ambuj
Sharma
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106439#RESPOND)
MARCH 1, 2016 AT 12:29 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106439)
Hii,
When I use full_join for Outlet Years my rowcount increase to 23590924. I did not understand why
full join is used and why rowcount is increasing.
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106442#RESPOND)
MARCH 1, 2016 AT 1:09 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106442)
Hi Ambuj
full_join function returns all rows and all columns from the chosen data sets. And, if a value is not
present it blatantly returns NA. In your case, you might not have speci ed the by parameter in
full_join.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
56/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
ginisk
sam
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106546#RESPOND)
MARCH 4, 2016 AT 1:42 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106546)
ginisk
sam
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106554#RESPOND)
MARCH 4, 2016 AT 6:25 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106554)
Dear Ambuj,
(https://datahack.analyticsvidhya.com/contest/skilltestAfter generated c .. i created d using distinct
machine-learning/)
d%
group_by(Outlet_Establishment_Year)%>%
distinct()
Then merge d with combi as ws :
combi <- merge(d, combi, by = "Outlet_Establishment_Year")
Then ready for encoding.
Thanks
Ambuj
Sharma
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106843#RESPOND)
MARCH 8, 2016 AT 5:51 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106843)
Thanks!
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
57/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
gaurav
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106463#RESPOND)
MARCH 2, 2016 AT 5:01 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106463)
Hi ,
Can you please send me the pdf le on [email protected]
(mailto:[email protected]) as i am unable to download the le from the link provided?
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Thanks in advance
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106476#RESPOND)
MARCH 2, 2016 AT 9:08 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106476)
Hi Gaurav,
As mentioned, you need to create a one-time user account to download the pdf. You can nd the
link in the End Notes.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Jhanak
Sharma1
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106708#RESPOND)
MARCH 6, 2016 AT 11:47 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106708)
Problem no.1 :
When I execute head(b) I get :
Item_Identi er Item_Count
(fctr) (int)
1 DRA12 9
2 DRA24 10
And not
OUT027 2215.876
OUT035 1463.705
I tried below command but again error:
> combi <- merge(b, combi, by = "Outlet_Identi er")
Error in x.by(by.x, x) : 'by' must specify a uniquely valid column
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
58/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Problem No.2 :
When I execute table(q)
I get:
Drinks Food Non-Consumable
2180488 16949063 4461373
and not
Drinks Food Non-Consumable
1317 10201 2686
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Problem No.3 :
combi <- dummy.data.frame(combi, names =
+ c('Outlet_Size','Outlet_Location_Type','Outlet_Type', 'Item_Type_New'), sep='_')
Error: cannot allocate vector of size 256.0 Mb
In addition: Warning messages:
1: In anyDuplicated.default(row.names) :
Reached total allocation of 3947Mb: see help(memory.size)
2: In anyDuplicated.default(row.names) :
Reached total allocation of 3947Mb: see help(memory.size)
Q. How to deal with Error: "cannot allocate vector of size"?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Please help me for solutions to the problems stated above
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106848#RESPOND)
MARCH 8, 2016 AT 7:04 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106848)
Hi Jhanak
Thank you so much for pointing this out.
Answer 1: The code is correct. The output I used required update. Its done now. You can check.
Answer 2: Ill require your code to answer it. Because, Ive checked again at my side, the output of
table(q) is
Drinks Food Non-Consumable
1317 10201 2686
Answer 3: Looks like your Problem 2 and Problem 3 are related. After you combine the data set,
check the dimension of combi data set. It should be 14204 rows and 12 columns.Looks like your
combi data set has too many observations. Usually, memory management issues are solved using
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
59/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
2 ways. First, by upgrading machine speci cations. Second, by using sparse matrix for
computation. Also, while using R and doing computation, it is advisable to close other programs
which are not necessary, especially chrome tabs. This will allow R to compute faster.
midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107334#RESPOND)
MARCH 14, 2016 AT 9:13 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE(https://datahack.analyticsvidhya.com/contest/theSCRATCH/#COMMENT-107334)
strategic-monk/)
Hi Janak, the dataset is not available now. It seems you have worked on the dataset. Can you
please share the dataset to [email protected] (mailto:[email protected]) It would be
of great help. Thanks.
VenuREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106698#RESPOND)
MARCH 6, 2016 AT 10:43 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106698)
Could you please share the data (./Data/BigMartSales) that you have used here so that we can
play it with ?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
VenuREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106699#RESPOND)
MARCH 6, 2016 AT 10:48 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106699)
It seems that your PDF le is missing in the correct link. May I request you to update it. Thanks in
advance.
Fred REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106702#RESPOND)
MARCH 6, 2016 AT 11:00 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106702)
buvana
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106791#RESPOND)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
60/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106849#RESPOND)
MARCH 8, 2016 AT 7:16 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106849)
Hi Buvana
Answer a ) Do you directly write codes in console ? Use R Studio. You should use R script as they
can be saved in .R format and helps you to retrieve codes at later time. For more information,
check the rst section of this tutorial.
Answer b) full_join is used when we wish to combine two columns. It return NA when no matching
value are found. merge is used when we wish to combine two columns based on a column type.
(https://datahack.analyticsvidhya.com/contest/skilltestIn full_join, you dont need to specify by parameter.
machine-learning/)
Answer c) Thank for pointing out. Sorted now.
Guilherme
Cadori says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106897#RESPOND)
MARCH 9, 2016 AT 4:00 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106897)
Hi,
In the Random Forest section, could you please explain why did you use ntree = 1000 after nding
mtry = 15?
Cheers,
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106982#RESPOND)
MARCH 10, 2016 AT 7:07 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106982)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
61/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Hi Guilherme
If you carefully check random forest section, Ive initially done cross validation using caret
package. Cross validation provided the optimal value of mtry and ntree at which the RMSE is least
(check output). I, then used those parameters in the nal random forest model. Another method to
choose mtry and ntree is hit and trial, which is certainly time consuming and inconsistent. You may
try this experiment at your end, and let me know if you obtain lesser RMSE than what Ive got.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Guilherme
Cadori says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107004#RESPOND)
MARCH 10, 2016 AT 12:46 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107004)
Hi Manish,
Thank you for your attention. I understood how you got mtry. However, in the output printed in this
tutorial, theres no valeu regarding ntree (e.g. ntree=1000, which was the value you used later on).
How did you get it?
Thanks,
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Arfath
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106925#RESPOND)
MARCH 9, 2016 AT 12:58 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106925)
Thank you very much for this wonderful and unique post. i came to this site to participate date
with your data competition. i was puzzled looking at the datsets like train,test and sample & i dont
have any idea what,and how to solve this. later on i came across this post (thank God i did) and
really after going through your post i gained con dence & i got a clear picture on how to handle
these competitions. once agian thanx from bottom of my heart.since i m completely new to this i
have few doubts
1) in linear_model <- lm(Item_Outlet_Sales ~ ., data = new_train)" what does tilde(~) followed by
dot (.) means?
2) what is the best RMSE score for any model?
3) so both train and test datsets are same,only thing is test data doesnt have response variable.
But, if we do know the response variable value from train dataset, again why we we are
calculating it for test data set? is it because we want to construct a model which predicts the
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
62/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
future outcomes, but we want to test how good our model predicts value, so thats why we took
sample from main dataset and cross check our predicted values with that of main dataset ?
correct me if my understanding is wrong
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106986#RESPOND)
MARCH 10, 2016 AT 8:09 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE(https://datahack.analyticsvidhya.com/contest/theSCRATCH/#COMMENT-106986)
strategic-monk/)
Hi Arfath
Good to know that you have started learning.
Answer 1: tilde(~) followed by dot (.) tells the model to select all the variables at once. Otherwise, it
would be so much inconvenient to write name of all variables one by one. Imagine the time which
would get wasted if you have got 200 variables to write. Therefore, use this short sign tilde(~)
followed by dot (.)
Answer 2: Ideally, every model strives for achieve RMSE as much as close to Zero. Because, Zero
means your model has accurately predicted the outcome. But, thats not possible. Since, every
model has got irreducible error which a ects the accuracy. Hence, the best RMSE score is the
least score you can get.
(https://datahack.analyticsvidhya.com/contest/skilltestAnswer 3: You are absolutely. Train data set has response variable and a model is trained on that.
machine-learning/)
This model gives you a fantastic RMSE score. But, it is worthless until it predicts with same
accuracy on out of sample data. The ultimate aim for this model is to make future predictions.
Right ? Hence, test data is used to check out of sample accuracy of the model. If the accuracy is
not as good as you achieved on train data set, it suggests that over tting has taken place.
I would recommend you to read Introduction to Statistical Learning. Download link is available in
my previous article: http://www.analyticsvidhya.com/blog/2016/02/free-read-books-statistics
mathematics-data-science/ (http://www.analyticsvidhya.com/blog/2016/02/free-read-booksstatistics-mathematics-data-science/)
vijaypk10
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106936#RESPOND)
MARCH 9, 2016 AT 5:43 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106936)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
63/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106981#RESPOND)
MARCH 10, 2016 AT 6:58 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106981)
Hi Vijay
Link is available in the tutorial.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Idea4Life
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107008#RESPOND)
MARCH 10, 2016 AT 1:59 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107008)
Sorry Manish. The link i believe you are mentioning is Big Mart Sales Prediction. But when i go
into it, it says The dataset is accessible only if the contest is active. Can you please check and
clarify?
Thanks,
Vijay
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
VK says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107105#RESPOND)
MARCH 11, 2016 AT 7:04 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107105)
Sorry Manish. Tried from the link Big Mart Sales Prediction in the document. But when i go to the
link Data Set, it shows up the following message:
The dataset is accessible only if the contest is active.
Can you please validate again?
Thanks.
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107145#RESPOND)
MARCH 12, 2016 AT 5:10 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107145)
Hi Vijay
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
64/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
The contest will get active again from tomorrow (13th March 2016).
Regret the inconvenience caused.
Alfa REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=106993#RESPOND)
MARCH 10, 2016 AT 9:52 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-106993)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Thanks for sharing.
I just can not understand what the One Hot Encoding means and how to use it. Because I just new
here.
Thanks!
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107148#RESPOND)
MARCH 12, 2016 AT 5:18 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107148)
Hi Alfa
One Hot Encoding is nothing but, splitting the levels of a categorical variable into new variable.
(https://datahack.analyticsvidhya.com/contest/skilltestThe new variables will be encoded with 0s and 1s. 1s represent the presence of information. 0s
machine-learning/)
represent the absence of information.
For example: Suppose, we have a variable named as Hair Color. It has 3 levels namely Red Hair,
Black Hair, Brown Hair. Doing one hot encoding of this variable, will result in 3 di erent variables
namely Red Hair, Black Hair, Brown Hair. And, the original variable Hair Color will be removed from
data set.
If someone has Red Hair, Red Hair variable will be 1, Black Hair will be 0, Brown Hair will be 0.
If someone has Black Hair, Red Hair variable will be 0, Black Hair will be 1, Brown Hair will be 0.
If someone has Brown Hair, Red Hair variable will be 0, Black Hair will be 0, Brown Hair will be 1.
Prateek
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107263#RESPOND)
MARCH 13, 2016 AT 2:10 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107263)
Hi Manish,
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
65/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Is it advisable to use One hot encoding when there is huge number of levels in a categorical
variable ?
midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107002#RESPOND)
MARCH 10, 2016 AT 11:48 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107002)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Can someone please mail me the data sets we need for this article to [email protected]
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107147#RESPOND)
MARCH 12, 2016 AT 5:12 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107147)
Hi Midhun
The data set will be available for download from tomorrow onwards (13th March 2016)
Regret the inconvenience caused.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107323#RESPOND)
MARCH 14, 2016 AT 7:27 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107323)
Hi Manish, sorry to bother you but it seems the data set is still unavailable. If its not too much
trouble, can you please mail the data to [email protected] (mailto:[email protected])
manojlakki7
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107069#RESPOND)
MARCH 11, 2016 AT 4:18 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107069)
Hi Manish,
Its a great article & gives a good start for beginner like me. Can you please share the data. I cant
download it from the link as the contest is not active.
Thank You
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
66/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107146#RESPOND)
MARCH 12, 2016 AT 5:11 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107146)
Hi Manoj
The data set will be available for download from tomorrow onwards. (13th March 2016)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Roy Basan (http://none) says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107491#RESPOND)
Good DayWhen I try to instal library(swirl) n R studio console ,,it states its not found in the version
R.3..2.4.. I got errors which statesWarning in install.packages :
package library(swirl) is not available (for R version 3.2.4)
Can somebody explain to me this peculiarity and how can I sort it out
Thanks
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107544#RESPOND)
Hi Roy
First you should install swirl package and then call it using library function. Use the commands
below.
> install.packages(swirl)
> library(swirl)
midhun1992
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107564#RESPOND)
MARCH 17, 2016 AT 5:36 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107564)
victoronclinx
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107616#RESPOND)
MARCH 17, 2016 AT 9:22 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107616)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
67/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/theAnalytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107636#RESPOND)
strategic-monk/)
MARCH 18, 2016 AT 5:37 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107636)
Hello
There were some technical updates going on at the server. Things are ne now. You may try again.
Regret the inconvenience caused.
Sourabh1987
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107799#RESPOND)
MARCH 19, 2016 AT 7:56 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107799)
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
trying feature engineering of the outlet _establishment year ,but the code for merging is creating a
lot of rows , i tried both merge as well full join .
JAYMIN
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=107918#RESPOND)
MARCH 21, 2016 AT 11:48 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-107918)
hello sir i am a fresher electrical engineer and my maths and logical thinking is good can i become
data scientist sir give me some advice thanks
Roy Basan
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=108023#RESPOND)
MARCH 22, 2016 AT 7:33 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-108023)
I did try to see the link to try the Big Market Prediction but unable to open it as it requires
membership. Now when I apply for the analytics Vidhya membership by signing up I got and
Invalid Request twice May I know how I can get over this issue.. Why I cant sign up..so I can
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
68/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Hulisani
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=109002#RESPOND)
APRIL 5, 2016 AT 4:02 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-109002)
(https://datahack.analyticsvidhya.com/contest/theHi
strategic-monk/)
Thanks for an amazing article. Can you please email me the data used.
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112100#RESPOND)
JUNE 10, 2016 AT 11:09 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112100)
Hi Hulisani
Please download the data set from here: http://datahack.analyticsvidhya.com/contest/practiceproblem-big-mart-sales-iii (http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-iii)
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Priyanka
Nath says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=109960#RESPOND)
APRIL 24, 2016 AT 7:20 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-109960)
Hi,
I am facing a problem in Random Forest execution.
I am using R Studio (R version 3.2.4 Revised)
When I am trying to run the code;
> rf_model print(rf_model), it is returning error in this form :
Error in { : task 1 failed cannot allocate vector of size 554.2 Mb In addition: Warning messages:
1: executing %dopar% sequentially: no parallel backend registered
2: In eval(expr, envir, enclos) :
model t failed for Fold1: mtry=15 Error in { : task 1 failed cannot allocate vector of size 354.7 Mb
3: In eval(expr, envir, enclos) :
model t failed for Fold2: mtry= 2 Error in { : task 1 failed cannot allocate vector of size 177.3 Mb
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
69/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Priyanka
Nath says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=109961#RESPOND)
APRIL 24, 2016 AT 7:23 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-109961)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112101#RESPOND)
JUNE 10, 2016 AT 11:11 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112101)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
70/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Hi Priyanka
Had I been at your place, I wouldnt have experimented with parallel random forest on this
problem.
Why make things complicated when it can be done in a simple way!
Also, make sure that you drop the ID column before running any algorithm. Things should work
ne then.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Raju REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=110044#RESPOND)
APRIL 26, 2016 AT 9:00 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-110044)
Hi Manish,
After reading the whole article, I feel u have done a great job and have given more than enough
data for a beginner.
Im thankful to u for sharing all your solutions, this would give us di erent thought for us to start
with.
Regards,
Raju.
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112099#RESPOND)
JUNE 10, 2016 AT 11:08 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112099)
Gregory
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=111547#RESPOND)
MAY 28, 2016 AT 11:50 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-111547)
Good morning
I can not nd the data set. Any suggestion?
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112098#RESPOND)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
71/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Hi Gregory
Please download the data from here: http://datahack.analyticsvidhya.com/contest/practiceproblem-big-mart-sales-iii (http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-iii)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Gregory
REPLYsays:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=111548#RESPOND)
MAY 28, 2016 AT 11:52 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-111548)
Toddhim
REPLY says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112093#RESPOND)
JUNE 10, 2016 AT 9:45 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-
(https://datahack.analyticsvidhya.com/contest/skilltestSCRATCH/#COMMENT-112093)
machine-learning/)
I know this is months after this great article was published, but im just now working through this
and the BigMart Sales Prediction dataset isnt available. Is it available elsewhere?
Analytics
Vidhya Content Team says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=112096#RESPOND)
JUNE 10, 2016 AT 11:06 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-112096)
Hi Toddim,
The data set is very well available. Ive already updated the links.
You can download the data from here: http://datahack.analyticsvidhya.com/contest/practiceproblem-big-mart-sales-iii (http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-iii)
vipin REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=113378#RESPOND)
JULY 12, 2016 AT 5:22 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-113378)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
72/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Hi Manish,
First of all thanks for a great article.
I encountered with a issue when I was running the codecombi <- full_join(c, combi, by="Outlet_Establishment_Year")
it is giving me errorError: std::bad_alloc
what it is and how to correct this
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
vipinREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=113383#RESPOND)
JULY 12, 2016 AT 8:36 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-113383)
2.
combi <- dummy.data.frame(combi, names = c('Outlet_Size','Outlet_Location_Type','Outlet_Type',
'Item_Type_New'), sep='_')
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
(https://datahack.analyticsvidhya.com/contest/skilltestsimarREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114293#RESPOND)
machine-learning/)
JULY 31, 2016 AT 3:27 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114293)
Hi Manish,
Can you please let me know what do you mean by Item_Fat_Content has mismatched factor
levels?
ParulREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114413#RESPOND)
AUGUST 3, 2016 AT 5:50 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114413)
Hi Manish. Thanks for this article. Very well written and will help all. I have one query: I could follow
your post very well beforeGraphical representation of Variables, after which I am unable to gure
out how to write these codes and what do they mean & signify, how to know which command to
use & when? I am a beginner in R . Can you please suggest what to do in order for me to fully
understand all the steps from Graphical Representation. This includes Data manipulation and
Predictive modeling as well. Thanks a lot.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
73/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
KarlWang
says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114457#RESPOND)
AUGUST 4, 2016 AT 8:37 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114457)
Very great article and thank you so much for sharing your knowledge! I am not sure if others have
some questions with me, but I list my questions. Hope you have some time to take a look at it.
(https://datahack.analyticsvidhya.com/contest/theThank you again.
strategic-monk/)
1. About the di erence between label encoding and one hot encoding. For label encoding, your
example is convert the 2 levels variables item_Fat_Content into 0 and 1. If I have a variable US
state (50 levels = 50 States), is it means I just need simply trans the states to number 1-50? But it is
still a one variables, just from category to numerical, am I right?
2. For one hot encoding, I need split into 50 variables (50 States) and marked them as 0s and 1s to
indicate existence or non-existence, am I right?
3. So what is the advantage and disadvantage to convert the category variables into numeric
variables? Why do we need to do this transformation?
4. In the article it said, We did one hot encoding and label encoding. Thats not necessary since
linear regression handle categorical variables by creating dummy variables intrinsically. How do
we know which model we need to do the one hot encoding/ label encoding?
(https://datahack.analyticsvidhya.com/contest/skilltest5. You mentation correlated variables. What level of correlation we need to remove the correlated
machine-learning/)
variables? 0.5 or 0.6 or 0.7 ? And if two variables is correlated, how to decide which one we should
remove? Is there any standard about it?
6. I am running logistic regression, when I remove one of the correlated variables (0.68), the R
dropped, is it means this level (0.68) correlation is acceptable?
7. The liner regression model with funnel share means heteroscedasticity. So how to evaluate the
logistic regression with Residuals vs Fitted graph?
8. In the article, it is said This model can be further improved by detecting outliers and high
leverage points. what is the technical to deal with these points? Just simply remove the record or
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
74/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Monish
Mathpal
says:
REPLY
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114761#RESPOND)
AUGUST 12, 2016 AT 6:53 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114761)
Vaibhav
says:
REPLYGupta
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114970#RESPOND)
AUGUST 20, 2016 AT 6:59 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114970)
Hi
I am beginner in Data Science using R. I was going through your well articulated article on Data
Science using R. I was practicing your Big Mart Predication and got confused with one step ,
(https://datahack.analyticsvidhya.com/contest/skilltestwhere it checks the missing values in train data exploration. As per R and this tutorial , there is only
machine-learning/)
missing values (i assume blank is being considered as missing data) in Item_Weight but data is
also missing in Outlet_Size in Train CSV.. But neither R or this tutorial is showing Outlet_Size as
missing values observations.
Can you please let me know how and why Outlet_Size is not considered as missing values in
data exploration of train.
Vaibhav
says:
REPLYGupta
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=114972#RESPOND)
AUGUST 20, 2016 AT 8:43 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-114972)
Hi
I would also like to know what all mathematical concepts like algebra , statics, are required to
learn Data Science using R? Can anybody list down all mathematical concepts required for Data
Science?
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
75/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Thanks
Vaibhav Gupta
IigoREPLY
says:
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=115093#RESPOND)
AUGUST 24, 2016 AT 1:28 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-115093)
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
Hello, I had an error when launching RStudio. I downloaded it again and installed it again, but
(https://datahack.analyticsvidhya.com/contest/skilltestShuu REPLY
says:(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=115131#RESPOND)
machine-learning/)
AUGUST 25, 2016 AT 1:13 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-115131)
As someone who came from a non-coding background, you should know that small details can
become HUGE hindrances in the learning process of a beginner.
On the Essentials part of the article, this code doesnt work:
> bar class(bar)
> integer
> as.numeric(bar)
> class(bar)
> numeric
> as.character(bar)
> class(bar)
> character
You have to actually set it as bar <- as.numeric(bar)' on the 4th line.
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
76/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Please, keep those small things in mind. It is insanely di cult for someone like me to learn this
content, if things are any less than perfect, it really becomes impossible (I just spent almost an
hour to gure out why I couldn't change the class of the object, and in the end, had to ask for
external help since I couldn't troubleshoot it myself).
Otherwise, great article, keep the great work up!
Cheers.
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
JoyceREPLY
Salil(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCE-SCRATCH/?REPLYTOCOM=117457#RESPOND)
says:
OCTOBER 24, 2016 AT 8:52 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/02/COMPLETE-TUTORIAL-LEARN-DATA-SCIENCESCRATCH/#COMMENT-117457)
LEAVE A REPLY
(https://datahack.analyticsvidhya.com/contest/skilltestConnect with:
machine-learning/)
(https://www.analyticsvidhya.com/wp-login.php?
action=wordpress_social_authenticate&mode=login&provider=Facebook&redirect_to=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.anal
tutorial-learn-data-science-scratch%2F)
Your email address will not be published.
Comment
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
77/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Name (required)
Email (required)
Website
(https:/
/datahack.analyticsvidhya.com/contest/thestrategic-monk/)
SUBMIT COMMENT
TOP AV USERS
Rank
1
Name
Points
SRK (https://datahack.analyticsvidhya.com/user/pro le/SRK)
(https:/
2 /datahack.analyticsvidhya.com/contest/skilltestAayushmnit (https://datahack.analyticsvidhya.com/user/pro le/aayushmnit)
machine-learning/)
5388
4978
4433
4417
3371
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
78/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(http://www.greatlearning.in/great-lakes-pgpba?
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
utm_source=avm&utm_medium=avmbanner&utm_campaign=pgpba)
POPULAR POSTS
A Complete Tutorial to Learn Data Science with Python from Scratch
(https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-pythonscratch-2/)
(https:/
/datahack.analyticsvidhya.com/contest/skilltestA Complete
Tutorial on Tree Based Modeling from Scratch (in R & Python)
machine-learning/)
(https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-inpython/)
Essentials of Machine Learning Algorithms (with Python and R Codes)
(https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)
17 Ultimate Data Science Projects To Boost Your Knowledge and Skills (& can be accessed freely)
(https://www.analyticsvidhya.com/blog/2016/10/17-ultimate-data-science-projects-to-boost-yourknowledge-and-skills/)
7 Types of Regression Techniques you should know!
(https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
79/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
(http://imarticus.org/diploma-in-big-data-analytics?
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
id=AnalyticsVidhya)
RECENT POSTS
(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learningrevealed/)
An Introduction to APIs (Application Programming Interfaces) & 5 APIs a Data Scientist must know!
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application-programminginterfaces-5-apis-a-data-scientist-must-know/)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
80/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-applicationprogramming-interfaces-5-apis-a-data-scientist-must-know/)
SAURAV KAUSHIK , NOVEMBER 18, 2016
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarup(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
bhattacharjee-analytics-vidhya-rank-8/)
Exclusive Interview with Data Scientist Bishwarup Bhattacharjee (Analytics Vidhya Rank 8)
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarup-bhattacharjeeanalytics-vidhya-rank-8/)
KUNAL JAIN , NOVEMBER 16, 2016
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-tobreak-the-ice-monday-blues/)
8 Interesting Data Science Games to break the ice & Monday Blues!
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-to-break-the-ice(https://datahack.analyticsvidhya.com/contest/skilltestmonday-blues/)
machine-learning/)
KUNAL JAIN , NOVEMBER 14, 2016
(http://www.edvancer.in/certi ed-data-scientist-with-python-
course?utm_source=AV&utm_medium=AVads&utm_campaign=AVadsnonfc&utm_content=pythonavad)
GET CONNECTED
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
81/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
GET CONNECTED
7,155
FOLLOWERS
(http://www.twitter.com/analyticsvidhya)
1,425
22,827
FOLLOWERS
(http://www.facebook.com/Analyticsvidhya)
(https:/
/datahack.analyticsvidhya.com/contest/the-
strategic-monk/)
FOLLOWERS
SUBSCRIBE
(https://plus.google.com/+Analyticsvidhya)
(http://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
ABOUT US
For those of you, who are wondering what is Analytics Vidhya, Analytics can be de ned as the science of
extracting insights from raw data. The spectrum of analytics starts from capturing data and evolves into
(https://datahack.analyticsvidhya.com/contest/skilltestusing insights / trends from this data to make informed decisions. Read More
machine-learning/)
(http://www.analyticsvidhya.com/about-me/)
STAY CONNECTED
7,155
FOLLOWERS
(http://www.twitter.com/analyticsvidhya)
1,425
FOLLOWERS
LATEST POSTS
(http://www.facebook.com/Analyticsvidhya)
FOLLOWERS
(https://plus.google.com/+Analyticsvidhya)
22,827
Email
SUBSCRIBE
(https://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
(https://www.analyticsvidhya.com/blog/2016/11/solution-for-skilltest-machine-learningrevealed/)
82/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application(https:/
/datahack.analyticsvidhya.com/contest/theprogramming-interfaces-5-apis-a-data-scientist-must-know/)
strategic-monk/)
AnIntroduction to APIs (Application Programming Interfaces) & 5 APIs a Data Scientist must know!
(https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-apis-application-programminginterfaces-5-apis-a-data-scientist-must-know/)
SAURAV KAUSHIK , NOVEMBER 18, 2016
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarupbhattacharjee-analytics-vidhya-rank-8/)
Exclusive Interview with Data Scientist Bishwarup Bhattacharjee (Analytics Vidhya Rank 8)
(https://www.analyticsvidhya.com/blog/2016/11/exclusive-interview-bishwarup-bhattacharjeeanalytics-vidhya-rank-8/)
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-tobreak-the-ice-monday-blues/)
8 Interesting Data Science Games to break the ice & Monday Blues!
(https://www.analyticsvidhya.com/blog/2016/11/8-interesting-data-science-games-to-break-the-icemonday-blues/)
QUICK LINKS
Home (https://www.analyticsvidhya.com/)
About Us (https://www.analyticsvidhya.com/about-me/)
Our team (https://www.analyticsvidhya.com/aboutme/team/)
Privacy Policy
(https://www.analyticsvidhya.com/privacy-policy/)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
83/84
20/11/2016
ACompleteTutorialtolearnDataScienceinRfromScratch
Refund Policy
(https://www.analyticsvidhya.com/refund-policy/)
Terms of Use (https://www.analyticsvidhya.com/terms/)
TOP REVIEWS
(https://datahack.analyticsvidhya.com/contest/thestrategic-monk/)
(https://datahack.analyticsvidhya.com/contest/skilltestmachine-learning/)
https://www.analyticsvidhya.com/blog/2016/02/completetutoriallearndatasciencescratch/#two
84/84