Data Analytics With R: Data Management Project By
Data Analytics With R: Data Management Project By
ANALYTICS
WITH R
DATA MANAGEMENT PROJECT BY:-
NISHANT CHATURVEDI
REBECCA NAMUBIRU
AGENDA
INTRODUCTION
P R O B L E M S TAT E M E N T
D ATA W R A N G L I N G
RANDOM FOREST
F E AT U R E A N A LY S I S
SUMMARY
Data is
not in Qualitative
proper analysis
Format
Cleansing
and Scientific
Wrangling Research
data
NISHANT CHATURVEDI | REBECCA NAMUBIRU 3
Click icon to add picture
K D N U G G E T S
S U R V E Y 2 0 1 9
70%
66%
60%
51%
50%
47%
POPULARITY
40%
35% 34% 33%
30%
20%
10%
0%
Python RapidMiner R Excel Anaconda SQL
TOOLS
Reference:
70000
Number of Scholalrly Articles
60000
55000
50000
40000 38000
33000 31000 30000
30000
20000
10000
0
SPSS R SAS Stata GraphPadPrism Matlab
TOOLS
Reference:
ATTRIBUTES
• C u s t o m e r _ i d : - i n t e g e r v a l u e s t o i d e n ti f y c u s t o m e r s
• S u b s c r i p ti o n _ fl a g : - i t h a s v a l u e s 0 , 1 , - 1 t o s i g n i f y i n g t h e s t a t u s o f c u s t o m e r , 0 f o r
unsubscribed, 1 for subscribed and -1 for unknown
• d a y s _ o f _ m e m b e r s h i p : - t h e t e n u r e o f t h e c u s t o m e r ’ s s u b s c r i p ti o n
• no_of_movie :- number of movies watched by a customer in the previous month
• No_of_serie :- number of TV series watched by the customer in the previous month
• No_of_documentary :- number of documentary watched by the customer in the previous
month
• a c ti v i t y _ ti m e : - A c ti v i t y ti m e o n t h e a p p l i c a ti o n i n t h e l a s t m o n t h
• a v g _ r a ti n g : - s u m o f r a ti n g s g i v e n / n u m b e r o f s h o w s r a t e d
• n o _ o f _ a c c o u n t s : - n u m b e r o f a c ti v e a c c o u n t s i n t h e s u b s c r i p ti o n
• d e v i c e _ t y p e : - w h a t a r e t h e d e v i c e s b e i n g u s e d t o a c c e s s t h e a p p l i c a ti o n
• age :- Age of the customer
• gender :- 0 for Female and 1 for Male
2. What key factors play the most important role in determining the subscription of a customer?
FEATURE ANALYSIS
Click icon to add picture Click icon to add picture
Predictive
• Converting data from Analysis • To find the relative
one format to another. importance of
• Gathering Data, • Identify future different parameters
Selecting Data, likelihood of an event • By giving scores to
Transforming Data • Applying Random different parameters
Forest to perform
Predictive Analysis
DATA Feature
WRANGLING Importance
DATA
WRANGLING
PREDICTIVE
ANALYSIS
FEATURE
IMPORTANC
E
DATA
WRANGLING
PREDICTIVE
ANALYSIS
FEATURE
IMPORTANC
E
DATA
WRANGLING
PREDICTIVE
ANALYSIS
FEATURE
IMPORTANC
E
Inspired by:
https://towardsdatascience.com/decision-trees- https://www.analyticsvidhya.com/blog/2020/05/
in-machine-learning-641b9c4e8052 decision-tree-vs-random-forest-algorithm/
Range: 0 - 1
Out of Box
For features not
included in the model
at the time when the Often computed rates
leaf / decision is Accuracy = (TP+TN)/n
reached Error Rate = (FP+FN)/n or 1-Accuracy
True Positive Rate = TP/actual yes
True Negative Rate = TN/actual no
Precision = TP/predicted yes
DATA
WRANGLING
PREDICTIVE
ANALYSIS
FEATURE
IMPORTANC
E
DATA
WRANGLING
PREDICTIVE
ANALYSIS
FEATURE
IMPORTANC
E
DATA
WRANGLING
PREDICTIVE
ANALYSIS
FEATURE
IMPORTANC
E
Remove Rank
Redundant Features by
Features Importance
Select
Features
DATA
WRANGLING
PREDICTIVE
ANALYSIS
FEATURE
IMPORTANC
E
DATA
WRANGLING
PREDICTIVE
ANALYSIS
FEATURE
IMPORTANC
E
Large
Reliable user
base
Many user-
contributed Time
packages saving
• https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platf
orms.html
(Survey by KD Nuggets)
• https://www.dataschool.io/simple-guide-to-confusion-matrix-
terminology/#:~:text=A%20confusion%20matrix%20is%20a,related%20terminology
%20can%20be%20confusing.
Icon Icon
Nishant Chaturvedi
Rebecca Namubiru