Sparkly R

sparkyl

Uploaded by

Nitish Kansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

403 views2 pages

Sparkly R

sparkyl

Uploaded by

Nitish Kansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Data Science in Spark Data Science Toolchain with Spark + sparklyr

Understand
Using sparklyr
A brief example of a data analysis using
with sparklyr Import Tidy Transform Visualize Communicate
Apache Spark, R and sparklyr in local mode
Cheat Sheet Export an R dplyr verb fd
Transformer fd
Collect data into Collect data
Direct Spark into R
fd
DataFrame
fd library(sparklyr); library(dplyr); library(ggplot2);
function R for plotting
Read a file SQL (DBI) Share plots, library(tidyr);
Read existing SDF function Model documents, Install Spark locally
Wrangle set.seed(100)
Hive table (Scala API) fd
Spark MLlib and apps
R for Data Science, Grolemund & Wickham
H2O Extension spark_install("2.0.1") Connect to local version

Intro Getting started sc <- spark_connect(master = "local")

sparklyr is an R interface for Local Mode import_iris <- copy_to(sc, iris, "spark_iris",
On a YARN Managed Cluster
Apache Spark, it provides a Easy setup; no cluster required overwrite = TRUE)
complete dplyr backend and sparklyr 1. Install RStudio Server or RStudio Pro on one
com

1. Install a local version of Spark: Copy data to Spark memory

the option to query directly
dio.

of the existing nodes, preferably an edge

tu
w.rs
ww

using Spark SQL statement. With sparklyr, you spark_install ("2.0.1") partition_iris <- sdf_partition(
node Partition
can orchestrate distributed machine learning 2. Open a connection import_iris,training=0.5, testing=0.5) data
2. Locate path to the clusters Spark Home
using either Sparks MLlib or H2O Sparkling sc <- spark_connect (master = "local")
Directory, it normally is /usr/lib/spark
Water. sdf_register(partition_iris,
3. Open a connection c("spark_iris_training","spark_iris_test"))
Starting with version 1.044, RStudio Desktop,
On a Mesos Managed Cluster spark_connect(master=yarn-client,
Server and Pro include integrated support for
version = 1.6.2, spark_home = [Clusters
the sparklyr package. You can create and 1. Install RStudio Server or Pro on one of the Create a hive metadata for each partition
Spark path])
manage connections to Spark clusters and local existing nodes
Spark instances from inside the IDE. tidy_iris <- tbl(sc,"spark_iris_training") %>%
2. Locate path to the clusters Spark directory
select(Species, Petal_Length, Petal_Width)
RStudio Integrates with sparklyr 3. Open a connection On a Spark Standalone Cluster Spark ML
spark_connect(master=[mesos URL], Decision Tree
Open connection log Disconnect 1. Install RStudio Server or RStudio Pro on
version = 1.6.2, spark_home = [Clusters model_iris <- tidy_iris %>% Model
Spark path]) one of the existing nodes or a server in the
ml_decision_tree(response="Species",
same LAN features=c("Petal_Length","Petal_Width"))
Using Livy (Experimental) 2. Install a local version of Spark:
spark_install (version = 2.0.1") test_iris <- tbl(sc,"spark_iris_test") Create
Open the 1. The Livy REST application should be running reference to
Spark UI on the cluster 3. Open a connection Spark table
spark_connect(master=spark:// pred_iris <- sdf_predict(
2. Connect to the cluster
Preview host:port, version = "2.0.1", model_iris, test_iris) %>%
Spark & Hive Tables 1K rows sc <- spark_connect(master = http://host:port , Bring data back
spark_home = spark_home_dir()) collect
method = livy") into R memory
for plotting
pred_iris %>%
Cluster Deployment Tuning Spark inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
Cluster Deployment Options Example Configuration Important Tuning Parameters
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
Managed Cluster Stand Alone Cluster config <- spark_config() with defaults continued
geom_point()
Cluster Worker Nodes Worker Nodes spark.executor.heartbeatInterval 10s
config$spark.executor.cores <- 2
Manager spark.network.timeout 120s
Driver Node config$spark.executor.memory <- "4G"
Driver Node
spark.executor.memory 1g
fd fd sc <- spark_connect (master = "yarn-
client", config = config, version = "2.0.1") spark.executor.cores 1

fd
YARN
or
fd spark.executor.extraJavaOptions
Mesos Important Tuning Parameters spark.executor.instances
with defaults
fd fd spark.yarn.am.cores
sparklyr.shell.executor-memory
spark_disconnect(sc) Disconnect
sparklyr.shell.driver-memory
spark.yarn.am.memory 512m
RStudio is a trademark of RStudio, Inc. CC BY RStudio [email protected] 844-448-1212 rstudio.com Learn more at spark.rstudio.com package version 0.5 Updated: 12/21/16
Import Visualize & Communicate Model (MLlib)
Copy a DataFrame into Spark Spark SQL commands Download data to R memory ml_decision_tree(my_table , response=Species", features=
DBI::dbWriteTable( r_table <- collect(my_table) c(Petal_Length" , "Petal_Width"))
sdf_copy_to(sc, iris, "spark_iris")
sc, "spark_iris", iris) plot(Petal_Width~Petal_Length, data=r_table)
ml_als_factorization(x, rating.column = "rating", user.column =
sdf_copy_to(sc, x, name, memory, repartition, DBI::dbWriteTable(conn, name, dplyr::collect(x) "user", item.column = "item", rank = 10L, regularization.parameter =
overwrite) value) Download a Spark DataFrame to an R DataFrame 0.1, iter.max = 10L, ml.options = ml_options())
sdf_read_column(x, column)
Import into Spark from a File From a table in Hive Returns contents of a single column to R
ml_decision_tree(x, response, features, max.bins = 32L, max.depth
Arguments that apply to all functions: = 5L, type = c("auto", "regression", "classification"), ml.options =
my_var <- tbl_cache(sc,
sc, name, path, options = list(), repartition = 0, Save from Spark to File System ml_options())
name= "hive_iris") Same options for: ml_gradient_boosted_trees
memory = TRUE, overwrite = TRUE Arguments that apply to all functions: x, path
tbl_cache(sc, name, force = TRUE) ml_generalized_linear_regression(x, response, features,
CSV spark_read_csv( header = TRUE, Loads the table into memory spark_read_csv( header = TRUE,
CSV intercept = TRUE, family = gaussian(link = "identity"), iter.max =
columns = NULL, infer_schema = TRUE, delimiter = ",", quote = "\"", escape = "\\",
my_var <- dplyr::tbl(sc, 100L, ml.options = ml_options())
delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL)
name= "hive_iris") ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
charset = "UTF-8", null_value = NULL)
dplyr::tbl(scr, ) JSON spark_read_json(mode = NULL) compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options())
JSON spark_read_json()
Creates a reference to the table PARQUET spark_read_parquet(mode = NULL) ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha =
PARQUET spark_read_parquet() without loading it into memory (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options())
Reading & Writing from Apache Spark ml_linear_regression(x, response, features, intercept = TRUE,
Wrangle alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options())
tbl_cache Same options for: ml_logistic_regression
Spark SQL via dplyr verbs ML Transformers sdf_copy_to
dplyr::tbl ml_multilayer_perceptron(x, response, features, layers, iter.max
dplyr::copy_to
Translates into Spark SQL statements ft_binarizer(my_table,input.col=Petal_ = 100, seed = sample(.Machine$integer.max, 1), ml.options =
DBI::dbWriteTable
Length, output.col="petal_large", ml_options())
my_table <- my_var %>%
threshold=1.2) ml_naive_bayes(x, response, features, lambda = 0, ml.options =
filter(Species=="setosa") %>%
sample_n(10) Arguments that apply to all functions: spark_read_<fmt> ml_options())
x, input.col = NULL, output.col = NULL sdf_collect
File ml_one_vs_rest(x, classifier, response, features, ml.options =
dplyr::collect
Direct Spark SQL commands System ml_options())
ft_binarizer(threshold = 0.5) sdf_read_column
my_table <- DBI::dbGetQuery( sc , SELECT * spark_write_<fmt> ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options())
Assigned values based on threshold
FROM iris LIMIT 10") ml_random_forest(x, response, features, max.bins = 32L,
ft_bucketizer(splits)
DBI::dbGetQuery(conn, statement) Extensions max.depth = 5L, num.trees = 20L, type = c("auto", "regression",
Numeric column to discretized column
"classification"), ml.options = ml_options())
ft_discrete_cosine_transform(invers Create an R package that calls the full Spark API &
Scala API via SDF functions e = FALSE) provide interfaces to Spark packages. ml_survival_regression(x, response, features, intercept =
Time domain to frequency domain TRUE,censor = "censor", iter.max = 100L, ml.options =
Core Types
sdf_mutate(.data) ft_elementwise_product(scaling.col) ml_options())
Element-wise product between 2 spark_connection() Connection between R and the
Works like dplyr mutate function ml_binary_classification_eval(predicted_tbl_spark, label,
columns Spark shell process
sdf_partition(x, ..., weights = NULL, seed spark_jobj() Instance of a remote Spark object score, metric = "areaUnderROC")
ft_index_to_string()
= sample (.Machine$integer.max, 1)) spark_dataframe() Instance of a remote Spark ml_classification_eval(predicted_tbl_spark, label, predicted_lbl,
Index labels back to label as strings
sdf_partition(x, training = 0.5, test = 0.5) ft_one_hot_encoder() DataFrame object metric = "f1")
sdf_register(x, name = NULL) Continuous to binary vectors ml_tree_feature_importance(sc, model)
ft_quantile_discretizer( n.buckets Call Spark from R
Gives a Spark DataFrame a table name
= 5L) invoke() Call a method on a Java object
sdf_sample(x, fraction = 1, replacement = spark_jobj() Create a new object by invoking a
Continuous to binned categorical invoke_new()
TRUE, seed = NULL) constructor
values spark_dataframe()
sdf_sort(x, columns) ft_sql_transformer(sql) invoke_static() Call a static method on an object sparklyr
Sorts by >=1 columns in ascending order ft_string_indexer( params = NULL) is an R
Column of labels into a column of Machine Learning Extensions interface
sdf_with_unique_id(x, id = "id")
label indices. ml_create_dummy_variables() ml_options()
Add unique ID column ft_vector_assembler() for
sdf_predict(object, newdata) ml_prepare_dataframe() ml_model()
Combine vectors into a single row-
Spark DataFrame with predicted values vector ml_prepare_response_features_intercept()
RStudio is a trademark of RStudio, Inc. CC BY RStudio [email protected] 844-448-1212 rstudio.com Learn more at spark.rstudio.com package version 0.5 Updated: 12/21/16

Weka Tutorial
100% (2)
Weka Tutorial
60 pages
Synchronous Replication
100% (2)
Synchronous Replication
26 pages
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
SQL Notes For S.Y BSC It
No ratings yet
SQL Notes For S.Y BSC It
273 pages
Machine Learning Models For Salary Prediction Dataset Using Python
No ratings yet
Machine Learning Models For Salary Prediction Dataset Using Python
5 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
ML Server
No ratings yet
ML Server
2,320 pages
Introduction To Spark With Sparklyr in R
No ratings yet
Introduction To Spark With Sparklyr in R
11 pages
Flexi MR Bts and Airscale Bts HW Roadmap q3 2018 PDF Free 1 28
No ratings yet
Flexi MR Bts and Airscale Bts HW Roadmap q3 2018 PDF Free 1 28
28 pages
Polars Vs Pandas - Benchmarking Performances and Beyond - LinkedIn
No ratings yet
Polars Vs Pandas - Benchmarking Performances and Beyond - LinkedIn
12 pages
Adm2000 Lab Guide
100% (1)
Adm2000 Lab Guide
48 pages
Mapreduce Lab
No ratings yet
Mapreduce Lab
36 pages
Matplotlib Fundamentals
No ratings yet
Matplotlib Fundamentals
31 pages
Ims 3
0% (1)
Ims 3
80 pages
Lab 1 - Accessing and Preparing Data
No ratings yet
Lab 1 - Accessing and Preparing Data
36 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
Tutorial-HDP-Administration V III
100% (1)
Tutorial-HDP-Administration V III
274 pages
Sparql: Parql Rotocol ND DF Uery Anguage
No ratings yet
Sparql: Parql Rotocol ND DF Uery Anguage
22 pages
Hbase PDF
No ratings yet
Hbase PDF
33 pages
Lab Sheet 05 - Numpy and Matplotlib
No ratings yet
Lab Sheet 05 - Numpy and Matplotlib
12 pages
Spark Scala Protected
No ratings yet
Spark Scala Protected
211 pages
The Poster Child of Open Source Business
No ratings yet
The Poster Child of Open Source Business
35 pages
Oracle Log4j
No ratings yet
Oracle Log4j
7 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Byzantine Machine Learning: A Primer: Rachid Guerraoui Nirupam Gupta Rafael Pinot
No ratings yet
Byzantine Machine Learning: A Primer: Rachid Guerraoui Nirupam Gupta Rafael Pinot
39 pages
Introduction To ETL in Python: Stefano Francavilla
No ratings yet
Introduction To ETL in Python: Stefano Francavilla
62 pages
056-054 DSE8x10 in Fixed Export (Base Load)
100% (1)
056-054 DSE8x10 in Fixed Export (Base Load)
4 pages
Hive Main Installation
No ratings yet
Hive Main Installation
2 pages
MapR Sandbox For Hadoop DocUpdateFor3.1.1
No ratings yet
MapR Sandbox For Hadoop DocUpdateFor3.1.1
7 pages
Jupyter Installation
100% (1)
Jupyter Installation
19 pages
Shiny - Shinyapps - Io - Getting Started PDF
No ratings yet
Shiny - Shinyapps - Io - Getting Started PDF
12 pages
Intro To Analytics and ML With Sparklyr
No ratings yet
Intro To Analytics and ML With Sparklyr
63 pages
Shiny PDF
No ratings yet
Shiny PDF
227 pages
Advance Python Sheet 1696337837
No ratings yet
Advance Python Sheet 1696337837
237 pages
Pig
No ratings yet
Pig
16 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
SQL-Transactions Theory and Hands-On Exercises
No ratings yet
SQL-Transactions Theory and Hands-On Exercises
85 pages
Machine Learning - Brief
No ratings yet
Machine Learning - Brief
12 pages
Chapter 2 R Ggplot2 Examples
No ratings yet
Chapter 2 R Ggplot2 Examples
22 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
No ratings yet
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
100 pages
Qlik Sense Installation Guide
No ratings yet
Qlik Sense Installation Guide
63 pages
Panel Simplex 4010+series PDF
No ratings yet
Panel Simplex 4010+series PDF
8 pages
Excel VBA - Objects
No ratings yet
Excel VBA - Objects
18 pages
Id-11652 Web Python Flask
No ratings yet
Id-11652 Web Python Flask
62 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Spark Introduction
No ratings yet
Spark Introduction
19 pages
Data Science in Spark With Sparklyr::: Cheat Sheet
No ratings yet
Data Science in Spark With Sparklyr::: Cheat Sheet
2 pages
Types of Networks
No ratings yet
Types of Networks
5 pages
MapGuide Programming Manual
No ratings yet
MapGuide Programming Manual
164 pages
Session 10
No ratings yet
Session 10
30 pages
Journal IBM 3
No ratings yet
Journal IBM 3
6 pages
TASTEM04 - TasWater Electrical Scope of Works Template
100% (1)
TASTEM04 - TasWater Electrical Scope of Works Template
28 pages
Spark With R
No ratings yet
Spark With R
6 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Mastering Advanced Analytics With Apache Spark
No ratings yet
Mastering Advanced Analytics With Apache Spark
75 pages
Apache Spark: The Future of Data Processing: Shreya A Ukkali 1DA21CS132 Sheetal C 1DA21CS128 Vanitharani V 1DA21CS157
No ratings yet
Apache Spark: The Future of Data Processing: Shreya A Ukkali 1DA21CS132 Sheetal C 1DA21CS128 Vanitharani V 1DA21CS157
17 pages
Sparkr: Scaling R Programs With Spark: Data Sources
No ratings yet
Sparkr: Scaling R Programs With Spark: Data Sources
6 pages
Pyspark
No ratings yet
Pyspark
10 pages
Tutorial Hbase
No ratings yet
Tutorial Hbase
100 pages
Inteligencia Artificial Con R
No ratings yet
Inteligencia Artificial Con R
52 pages
Final
No ratings yet
Final
40 pages
Ainf280x SV
No ratings yet
Ainf280x SV
590 pages
Corriges Exos
No ratings yet
Corriges Exos
16 pages
Asterisk PBX To CS1000
No ratings yet
Asterisk PBX To CS1000
9 pages
Experimental Study of OFDM Implementation - Utilizing GNU Radio and USRP
No ratings yet
Experimental Study of OFDM Implementation - Utilizing GNU Radio and USRP
4 pages
NetStumbler Guide2
0% (1)
NetStumbler Guide2
3 pages
Tutorial GitLab Webstorm TortoiseGit
No ratings yet
Tutorial GitLab Webstorm TortoiseGit
32 pages
WD Notes Unit-5
No ratings yet
WD Notes Unit-5
12 pages
Joseph J. Sarna Jr. JJS Systems, LLC: Presented by
No ratings yet
Joseph J. Sarna Jr. JJS Systems, LLC: Presented by
18 pages
Flynn's Classification Divides Computers Into Four Major Groups That Are
No ratings yet
Flynn's Classification Divides Computers Into Four Major Groups That Are
44 pages
Comtech cdm570-570l
No ratings yet
Comtech cdm570-570l
4 pages
Wep Crack
No ratings yet
Wep Crack
5 pages
Лекция 5. Routing VRRP
No ratings yet
Лекция 5. Routing VRRP
84 pages
Avaya Communication Manager: System Capacities Table Release 3.1
No ratings yet
Avaya Communication Manager: System Capacities Table Release 3.1
36 pages
Itab Sem 1
No ratings yet
Itab Sem 1
22 pages
Data Communication & Computer Network
No ratings yet
Data Communication & Computer Network
3 pages
Installation Manual-SK40-SK50
No ratings yet
Installation Manual-SK40-SK50
40 pages
TOYOPUC-Plus Include MCML - E - M1066-3E PDF
No ratings yet
TOYOPUC-Plus Include MCML - E - M1066-3E PDF
4 pages
Artery-C: An Omnet++ Based Discrete Event Simulation Framework For Cellular V2X Extended Version
No ratings yet
Artery-C: An Omnet++ Based Discrete Event Simulation Framework For Cellular V2X Extended Version
12 pages
Piddidal 1999 - AOL Government Account Receiving Porn - Brian Ahern Big B in CT
No ratings yet
Piddidal 1999 - AOL Government Account Receiving Porn - Brian Ahern Big B in CT
5 pages
TPG FTTB Quick Setup Guide
No ratings yet
TPG FTTB Quick Setup Guide
16 pages
High Performance, Low Profile, Dual Port 1/10/25gbe Unified Wire Adapter
No ratings yet
High Performance, Low Profile, Dual Port 1/10/25gbe Unified Wire Adapter
2 pages
Analog Communication - Multiplexing
No ratings yet
Analog Communication - Multiplexing
4 pages
Abhay Chahar RESUME 1
No ratings yet
Abhay Chahar RESUME 1
2 pages
mx6000 Series Rev e
No ratings yet
mx6000 Series Rev e
2 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

Sparkly R

Uploaded by

Sparkly R

Uploaded by

Data Science in Spark Data Science Toolchain with Spark + sparklyr

Intro Getting started sc <- spark_connect(master = "local")

1. Install a local version of Spark: Copy data to Spark memory

of the existing nodes, preferably an edge

You might also like