Using R in Azure ML
Using R in Azure ML
R – Ecosystem Fundamentals
Azure ML + R
Assumptions
• We can’t do any one topic proper justice.
• So this talk will introduce the core ecosystem for your own follow up.
• No mathematical proofs.
Hopefully.
• You will know what you don’t know about Data Science
• Set expectations and realities about Data Science
Python Pros
• Best all round script language. Data science support improving.
• Better 64 bit support and scalability?
Or use R Studio.
R
To Package Install/
use an installed package.Reference
At the command line.
library("ggplot2")
R Studio code hint.
lattice
• Enhanced package. Not very widely adopted.
Concise way to show median, 1st/ 3rd quartiles, 1.5 * IQR and outliers.
Scatter plot matrix and R pairs function
sqldf
• Surprisingly good SQL syntax fidelity
knitr
R
• RDynamic
Markdown +Report
embeddedPackages
R code => reports. HTML/ PDF/ Latex.
• Ideal platform for Reproducible Research.
• Demo. Properly cool.
shiny
• Interactive publishing of R driven web pages. Client and server bits.
slidify
• Generation of slide decks from R Markdown/ YAML/ R.
R – Selected Language Basics
R Fundamental Data Structures
Outliers
• Extreme values well outside the norm. Eg Australia’s billionaires
• How are they handled? Depends.
Unsupervised Learning
• No past results to train on, thus more difficult to evaluate
• Find patterns, often using clustering
• Eg Google News
Supervised Learning Experiments
Split available data into training and test samples
• Often training 70% as a rule of thumb
• Fit a model against training of close to just right accuracy
• Validate model against test set
Beware of.
• Underfitting. Not a convincing predictor.
• Overfitting. Too much fitting of errors/ outliers. Great fit of training
data, rubbish for other data sets.
Experiment Types
Your training data has a lot of features. Should we use them all?
• No! Too many dimensions, too much noise.
• Punt collinear features, those with marginal value
• Combine features where it makes sense
• randomForest model to assess importance
• Stepwise elimination of features, R has step() function
• Be ruthless!
Averages and Standard Deviation
How to do an average.
• Mean. Sum of observations / # of observations – outlier sensitive
• Median. Middle value
• Mode. Most common value, best for factors (categorical)
Spread of data.
• Variance is (Value – Mean) squared / # observations. Square to (a)
take absolute value (b) better vibe of the data.
• Take square root of variance to get Standard Deviation which brings
value in same scale as observations, thus commonly used.
Normalize Data/ R scale function
Solution? Scaling.
• Common solution is to normalize data to a scale where mean = 0 and
standard deviation = 1.
Principles
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Azure ML – Quick Overview
Azure ML – Get Started
I/O
• 3 input ports, internally use “tables” t1, t2 and t3
• 1 output port with results
Execute R Script
• Dataset[12]; Azure table -> R data frame
• Script bundle; Zip -> code, objects, packages
3 input ports
2 output ports
Fairly mechanical.
• Create your own source function(s) in a .R file
• Zip up that file, with the name you want displayed in ML
• In ML, call Add Dataset to import file.
• Visible in My Datasets in ML.
Own R Library Example
Create R Model Module
A module which includes model and scoring scripts
• Own R environment
• Only pre loaded R packages
• Only one output, no graphics
I/O
• Input. Training data frame
• Output. Model object.
Scripts
• Trainer script
• Scorer: uses R predict function
Sample R Model Module Code
Energy
•
Efficiency Visualisation
Project Columns module to punt a few columns.
continued
• Use the Linear Regression, solution method Ordinary Least Squares.