0% found this document useful (0 votes)
45 views

Using R in Azure ML

This document provides an introduction to using R in Azure Machine Learning. It begins with an overview of the R ecosystem and key packages. It then covers data science principles like feature selection and normalization. The document provides a brief overview of Azure ML and how R can be used within it. It emphasizes that Azure ML allows R models and code to be reused via web services.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Using R in Azure ML

This document provides an introduction to using R in Azure Machine Learning. It begins with an overview of the R ecosystem and key packages. It then covers data science principles like feature selection and normalization. The document provides a brief overview of Azure ML and how R can be used within it. It emphasizes that Azure ML allows R models and code to be reused via web services.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Using R in Azure Machine Learning

Take Azure ML to the next level with R

Alnis Bajars. [email protected] @alnisb


Agenda
Using R in Azure Machine Learning

R – Ecosystem Fundamentals

R – Selected Language Elements

Data Science Principles (including some lingo)

Azure ML Quick Overview

Azure ML + R
Assumptions
• We can’t do any one topic proper justice.
• So this talk will introduce the core ecosystem for your own follow up.
• No mathematical proofs.

Hopefully.
• You will know what you don’t know about Data Science
• Set expectations and realities about Data Science

These slides are meant to be used !


References
Coursera – R Programming (Johns Hopkins)
• Quality of explanations variable.
• But the practical assignments are good, and deadline driven.

Safari Books Online


• Video. Introduction to Data Science with R. Garrett Grolemund (R Studio).
• Most published books on R

Hadley Wickham (@hadleywickham) – Modern Godfather of R


• Chief Data Scientist at R Studio
• Author of influential R packages

Azure Machine Learning


• Intro video at Microsoft Virtual Academy
References contnued
edX – Azure ML with R/ Python
• A number of demos driven by this content.
R – Ecosystem Fundamentals
R Pros
R vs Python
• More mature data science support (20 years +), purpose built
• More established ML support
• T-SQL integration in 2016 (out of scope)

Python Pros
• Best all round script language. Data science support improving.
• Better 64 bit support and scalability?

R performance and scalability – don’t forget Revolution Analytics


https://cran.r-project.org/ (Revolution Analytics R not covered)
Get R Studio. https://www.rstudio.com/
R Ecosystem
• Essential IDE. But–much
Essential Bits RPubs etc..
more, packages,
• Download R, then R Studio.

Get a Github account. https://github.com/ and Github shell.


• Distributed source code control system.
• Essential part aofbetter
RStudio R social network.
environment for test and debug than Azure ML!
From github.com
Github Lifecycle Cheat Sheet
• Create repository (or fork someone else’s).

From local Github shell.


git clone <URL_of_repository>
cd <repository>
git add <files>
git commit –a –m “some_message”
git push
R Package Management

You’ll be doing this a lot!


To install a package at the command line.
install.packages("ggplot2“) (multiple dependency options)

Or use R Studio.
R
To Package Install/
use an installed package.Reference
At the command line.
library("ggplot2")
R Studio code hint.

Can install libraries from Github (user/repository)


library(“devtools")
install_github( 'ramnathv/rCharts')

Older versions of install_github have user and repository as separate arguments


plot
R Visualisation
• Standard Packages
package. Easy to use but presentation ordinary.

lattice
• Enhanced package. Not very widely adopted.

ggplot (by Hadley Wickham) – Grammar of Graphics


• Best quality presentations yet easy to use
• Layers approach: ggplot
• Quickie version: qplot
qplot simple example
ggplot2 example inc Linear Model
ggplot2 … if you really want to get funky…
ggplot2 and the Boxplot

Concise way to show median, 1st/ 3rd quartiles, 1.5 * IQR and outliers.
Scatter plot matrix and R pairs function

Concise way to relationships between all features.


R Data Wrangling Packages
dplyr
• Extensive function set for select/ sort/ filter/ derived columns/ group by/ top n.
• Note %>% directive to chain dplyr functions – pipeline like

tidyr (Hadley Wickham)


• Statisticians called cleansed data tidy data.
• Normalise/ denormalise.

sqldf
• Surprisingly good SQL syntax fidelity
knitr
R
• RDynamic
Markdown +Report
embeddedPackages
R code => reports. HTML/ PDF/ Latex.
• Ideal platform for Reproducible Research.
• Demo. Properly cool.

shiny
• Interactive publishing of R driven web pages. Client and server bits.

slidify
• Generation of slide decks from R Markdown/ YAML/ R.
R – Selected Language Basics
R Fundamental Data Structures

Script language (Perl/ Python/ Ruby) data structures.


• Scalar
• Array
• Hash (key/value)

Contrast with R data structures


• Vector (a “scalar” is really a 1 element vector)
• Matrix (caveat – data of same type)

R is case sensitive everywhere! (Variables, functions etc.)


The data frame is an operational tabular structure, integral to data manipulation.
R Data Types
Atomic data types.
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
typeof function handy
R Assignment and c function

Two different modes, generally equivalent.


• The <- form most popular.

c for combine to build free form vectors.


Reading Data and Missing Values

A number of functions to read data files (usually read.table).


• Generally into data frames.

How are values not entered handled?


• R default is NA
• This can be overwritten
Looking at the data

A number of handy functions. (Factor – discrete values)


R as a Functional Programming Language

In R, functions are 1st class objects. This is widely used.


Eg apply family of functions. apply, sapply, lapply
View command – R Studio Console

Needs no further introduction!


Data Science Principles
Some General Notes
Algorithms vs Data
• Lots of data tends to be more influential than choice of algorithm
• Data collection methodology is critical

Correlation implies Causation?


• No!

Outliers
• Extreme values well outside the norm. Eg Australia’s billionaires
• How are they handled? Depends.

Variable Types (affects Algorithm choice)


• Continuous, eg apartment price
• Discrete, eg species of Iris. Don’t forget R function stringsAsFactors
Data Analysis Flowchart
Codebook and Interpetation
Codebook is what Statisticians call the document that is
• Field spec of the data
• Details about the data collection

Reference to data set


• US NOAA storm database
http://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype
Read and interpret the Codebook carefully
• Eg Time based issues, all weather events only recorded since 1/1/96
• Careful combining features, eg # fatalities + # injuries does
not make sense
Machine Learning – Predictive Types
Supervised Learning
• Train model based on past results, validate with test data
• Independent variables or features as predictors
• Label or dependent variable to predict.
• Eg predict house price based on size, # rooms etc

Unsupervised Learning
• No past results to train on, thus more difficult to evaluate
• Find patterns, often using clustering
• Eg Google News
Supervised Learning Experiments
Split available data into training and test samples
• Often training 70% as a rule of thumb
• Fit a model against training of close to just right accuracy
• Validate model against test set

Beware of.
• Underfitting. Not a convincing predictor.
• Overfitting. Too much fitting of errors/ outliers. Great fit of training
data, rubbish for other data sets.
Experiment Types

At a very high level.


• Regression. Fit mathematical (often linear) to predict continuous
values.
• Classification. Predict discrete values.
• Clustering. Group data items based on similarity.
• Recommender.
• Anomaly Detection. Detect exception cases.
Feature Selection

Your training data has a lot of features. Should we use them all?
• No! Too many dimensions, too much noise.
• Punt collinear features, those with marginal value
• Combine features where it makes sense
• randomForest model to assess importance
• Stepwise elimination of features, R has step() function
• Be ruthless!
Averages and Standard Deviation
How to do an average.
• Mean. Sum of observations / # of observations – outlier sensitive
• Median. Middle value
• Mode. Most common value, best for factors (categorical)

Spread of data.
• Variance is (Value – Mean) squared / # observations. Square to (a)
take absolute value (b) better vibe of the data.
• Take square root of variance to get Standard Deviation which brings
value in same scale as observations, thus commonly used.
Normalize Data/ R scale function

Features you want to compare naturally have different scales.


• Eg
• The bigger numbers will swamp small numbers in importance.

Solution? Scaling.
• Common solution is to normalize data to a scale where mean = 0 and
standard deviation = 1.

Note Azure ML has a Normalize Data module. R has a scale function.


Hypothesis Testing and Confidence Intervals
The protocol for hypothesis.
• Hypothesis 0 is the status quo.
• Hypothesis 1 is the alternative (eg new drug).
• Aim is to reject H0 in favour of H1 (or not)

The result is generally framed within a confidence level (p value).


• Commonly use 95%, a throwback to pre computer days.
• Controversy. The Earth is Round (p < 0.05)
Tidy Data

Described by Hadley Wickham in


• Paper - http://vita.had.co.nz/papers/tidy-data.pdf
• Video - https://vimeo.com/33727555

Principles
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Azure ML – Quick Overview
Azure ML – Get Started

What you need. Site is https://studio.azureml.net


• Azure account – does not have to be a trial, Machine Learning has a
free tier.
• Storage account.

* Trap. Storage account must be in same location as ML. Australia


might not be available.
Azure ML – Flowchart
Azure ML Example – Compare Two Models
Azure ML Example – Prefab Data Wrangling
Azure ML – Re-use and Monetisation

Re-use via web services.


• REST APIs
• Code snippets in C#, R and Python.

Publish said web services to Azure Marketplace.


• Fairly involved diligence process including approvals.

Sadly, both topics out of scope.


Apply SQL Transformation Module

Use SQL syntax for data wrangling, based on SQLite.

I/O
• 3 input ports, internally use “tables” t1, t2 and t3
• 1 output port with results

Within Azure ML, an easier alternative to the R package sqldf.


Extend ML with R
• Its own environment (avoid namespace collisions)
• Need to load packages

Execute R Script and I/O


• Install new packages via zip

Execute R Script
• Dataset[12]; Azure table -> R data frame
• Script bundle; Zip -> code, objects, packages

3 input ports

2 output ports

• Results; R data frame -> Azure table


• R Device; stdout, stderr, graphics
Template code for Execute R Script
Execute R Script – a “real” example
Debugging R Code
What if code runs ok in RStudio but not in ML?

There is no debugger as such in ML, so


• Induce an error in R code, eg refer uninitialised object
• Right click R script module, select View Error Log
• Right click R script module, select View Output Log

Latter has more detail


Sample Output Log
Create Your Own R Library

Fairly mechanical.
• Create your own source function(s) in a .R file
• Zip up that file, with the name you want displayed in ML
• In ML, call Add Dataset to import file.
• Visible in My Datasets in ML.
Own R Library Example
Create R Model Module
A module which includes model and scoring scripts
• Own R environment
• Only pre loaded R packages
• Only one output, no graphics

I/O
• Input. Training data frame
• Output. Model object.

Scripts
• Trainer script
• Scorer: uses R predict function
Sample R Model Module Code

Note most set and get functions local to R Model Module.


Sample training script.

Sample scoring script.


Loading R Packages into Azure ML

There are “only” 350 R Packages in Azure ML – you’ll eventually want to


use other packages.

To load an R Package into Azure ML.


• Find the package and download as zip locally
• In ML Studio, select the big “+ NEW” option bottom LHS
• Select DATASET -> FROM LOCAL FILE
• Follow the bouncing ball
Using Loaded R Packages in Azure ML

Effectively need to install each use in Execute R Script.


Demos – CA Dairy Data

Really simple example of R, plus custom library in action.


Steps we take.
• Make Overall Height and Orientation categorical (what R calls Factors).
Energy
• Efficiency
Make all column Visualisation
headers CamelCase (remove spaces) to play nicer with R.
• Add R code to use dplyr to create derived columns for squares and cubes.
• Normalize Data for all numeric columns, transformation method MinMax. Mean 0
and standard deviation 1.
• Add R code to visualise data.
Now let’s do some data science !

Energy

Efficiency Visualisation
Project Columns module to punt a few columns.
continued
• Use the Linear Regression, solution method Ordinary Least Squares.

• Split Data module – 60% training, 40% test


• Train Model module – Linear Regression plus Training data
• Permutation Feature Importance to score model against Test data
Energy Efficiency Visualisation – the score

The relative feature importance.


Summary

Please take this presentation as a call to action.

Alnis Bajars. Email: [email protected] Twitter: @alnisb

You might also like