0% found this document useful (0 votes)
403 views2 pages

Sparkly R

sparkyl

Uploaded by

Nitish Kansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
403 views2 pages

Sparkly R

sparkyl

Uploaded by

Nitish Kansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Science in Spark Data Science Toolchain with Spark + sparklyr

Understand
Using sparklyr
A brief example of a data analysis using
with sparklyr Import Tidy Transform Visualize Communicate
Apache Spark, R and sparklyr in local mode
Cheat Sheet Export an R dplyr verb fd
Transformer fd
Collect data into Collect data
Direct Spark into R
fd
DataFrame
fd library(sparklyr); library(dplyr); library(ggplot2);
function R for plotting
Read a file SQL (DBI) Share plots, library(tidyr);
Read existing SDF function Model documents, Install Spark locally
Wrangle set.seed(100)
Hive table (Scala API) fd
Spark MLlib and apps
R for Data Science, Grolemund & Wickham
H2O Extension spark_install("2.0.1") Connect to local version

Intro Getting started sc <- spark_connect(master = "local")

sparklyr is an R interface for Local Mode import_iris <- copy_to(sc, iris, "spark_iris",
On a YARN Managed Cluster
Apache Spark, it provides a Easy setup; no cluster required overwrite = TRUE)
complete dplyr backend and sparklyr 1. Install RStudio Server or RStudio Pro on one
com

1. Install a local version of Spark: Copy data to Spark memory


the option to query directly
dio.

of the existing nodes, preferably an edge


tu
w.rs
ww

using Spark SQL statement. With sparklyr, you spark_install ("2.0.1") partition_iris <- sdf_partition(
node Partition
can orchestrate distributed machine learning 2. Open a connection import_iris,training=0.5, testing=0.5) data
2. Locate path to the clusters Spark Home
using either Sparks MLlib or H2O Sparkling sc <- spark_connect (master = "local")
Directory, it normally is /usr/lib/spark
Water. sdf_register(partition_iris,
3. Open a connection c("spark_iris_training","spark_iris_test"))
Starting with version 1.044, RStudio Desktop,
On a Mesos Managed Cluster spark_connect(master=yarn-client,
Server and Pro include integrated support for
version = 1.6.2, spark_home = [Clusters
the sparklyr package. You can create and 1. Install RStudio Server or Pro on one of the Create a hive metadata for each partition
Spark path])
manage connections to Spark clusters and local existing nodes
Spark instances from inside the IDE. tidy_iris <- tbl(sc,"spark_iris_training") %>%
2. Locate path to the clusters Spark directory
select(Species, Petal_Length, Petal_Width)
RStudio Integrates with sparklyr 3. Open a connection On a Spark Standalone Cluster Spark ML
spark_connect(master=[mesos URL], Decision Tree
Open connection log Disconnect 1. Install RStudio Server or RStudio Pro on
version = 1.6.2, spark_home = [Clusters model_iris <- tidy_iris %>% Model
Spark path]) one of the existing nodes or a server in the
ml_decision_tree(response="Species",
same LAN features=c("Petal_Length","Petal_Width"))
Using Livy (Experimental) 2. Install a local version of Spark:
spark_install (version = 2.0.1") test_iris <- tbl(sc,"spark_iris_test") Create
Open the 1. The Livy REST application should be running reference to
Spark UI on the cluster 3. Open a connection Spark table
spark_connect(master=spark:// pred_iris <- sdf_predict(
2. Connect to the cluster
Preview host:port, version = "2.0.1", model_iris, test_iris) %>%
Spark & Hive Tables 1K rows sc <- spark_connect(master = http://host:port , Bring data back
spark_home = spark_home_dir()) collect
method = livy") into R memory
for plotting
pred_iris %>%
Cluster Deployment Tuning Spark inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
Cluster Deployment Options Example Configuration Important Tuning Parameters
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
Managed Cluster Stand Alone Cluster config <- spark_config() with defaults continued
geom_point()
Cluster Worker Nodes Worker Nodes spark.executor.heartbeatInterval 10s
config$spark.executor.cores <- 2
Manager spark.network.timeout 120s
Driver Node config$spark.executor.memory <- "4G"
Driver Node
spark.executor.memory 1g
fd fd sc <- spark_connect (master = "yarn-
client", config = config, version = "2.0.1") spark.executor.cores 1

fd
YARN
or
fd spark.executor.extraJavaOptions
Mesos Important Tuning Parameters spark.executor.instances
with defaults
fd fd spark.yarn.am.cores
sparklyr.shell.executor-memory
spark_disconnect(sc) Disconnect
sparklyr.shell.driver-memory
spark.yarn.am.memory 512m
RStudio is a trademark of RStudio, Inc. CC BY RStudio [email protected] 844-448-1212 rstudio.com Learn more at spark.rstudio.com package version 0.5 Updated: 12/21/16
Import Visualize & Communicate Model (MLlib)
Copy a DataFrame into Spark Spark SQL commands Download data to R memory ml_decision_tree(my_table , response=Species", features=
DBI::dbWriteTable( r_table <- collect(my_table) c(Petal_Length" , "Petal_Width"))
sdf_copy_to(sc, iris, "spark_iris")
sc, "spark_iris", iris) plot(Petal_Width~Petal_Length, data=r_table)
ml_als_factorization(x, rating.column = "rating", user.column =
sdf_copy_to(sc, x, name, memory, repartition, DBI::dbWriteTable(conn, name, dplyr::collect(x) "user", item.column = "item", rank = 10L, regularization.parameter =
overwrite) value) Download a Spark DataFrame to an R DataFrame 0.1, iter.max = 10L, ml.options = ml_options())
sdf_read_column(x, column)
Import into Spark from a File From a table in Hive Returns contents of a single column to R
ml_decision_tree(x, response, features, max.bins = 32L, max.depth
Arguments that apply to all functions: = 5L, type = c("auto", "regression", "classification"), ml.options =
my_var <- tbl_cache(sc,
sc, name, path, options = list(), repartition = 0, Save from Spark to File System ml_options())
name= "hive_iris") Same options for: ml_gradient_boosted_trees
memory = TRUE, overwrite = TRUE Arguments that apply to all functions: x, path
tbl_cache(sc, name, force = TRUE) ml_generalized_linear_regression(x, response, features,
CSV spark_read_csv( header = TRUE, Loads the table into memory spark_read_csv( header = TRUE,
CSV intercept = TRUE, family = gaussian(link = "identity"), iter.max =
columns = NULL, infer_schema = TRUE, delimiter = ",", quote = "\"", escape = "\\",
my_var <- dplyr::tbl(sc, 100L, ml.options = ml_options())
delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL)
name= "hive_iris") ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
charset = "UTF-8", null_value = NULL)
dplyr::tbl(scr, ) JSON spark_read_json(mode = NULL) compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options())
JSON spark_read_json()
Creates a reference to the table PARQUET spark_read_parquet(mode = NULL) ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha =
PARQUET spark_read_parquet() without loading it into memory (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options())
Reading & Writing from Apache Spark ml_linear_regression(x, response, features, intercept = TRUE,
Wrangle alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options())
tbl_cache Same options for: ml_logistic_regression
Spark SQL via dplyr verbs ML Transformers sdf_copy_to
dplyr::tbl ml_multilayer_perceptron(x, response, features, layers, iter.max
dplyr::copy_to
Translates into Spark SQL statements ft_binarizer(my_table,input.col=Petal_ = 100, seed = sample(.Machine$integer.max, 1), ml.options =
DBI::dbWriteTable
Length, output.col="petal_large", ml_options())
my_table <- my_var %>%
threshold=1.2) ml_naive_bayes(x, response, features, lambda = 0, ml.options =
filter(Species=="setosa") %>%
sample_n(10) Arguments that apply to all functions: spark_read_<fmt> ml_options())
x, input.col = NULL, output.col = NULL sdf_collect
File ml_one_vs_rest(x, classifier, response, features, ml.options =
dplyr::collect
Direct Spark SQL commands System ml_options())
ft_binarizer(threshold = 0.5) sdf_read_column
my_table <- DBI::dbGetQuery( sc , SELECT * spark_write_<fmt> ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options())
Assigned values based on threshold
FROM iris LIMIT 10") ml_random_forest(x, response, features, max.bins = 32L,
ft_bucketizer(splits)
DBI::dbGetQuery(conn, statement) Extensions max.depth = 5L, num.trees = 20L, type = c("auto", "regression",
Numeric column to discretized column
"classification"), ml.options = ml_options())
ft_discrete_cosine_transform(invers Create an R package that calls the full Spark API &
Scala API via SDF functions e = FALSE) provide interfaces to Spark packages. ml_survival_regression(x, response, features, intercept =
Time domain to frequency domain TRUE,censor = "censor", iter.max = 100L, ml.options =
Core Types
sdf_mutate(.data) ft_elementwise_product(scaling.col) ml_options())
Element-wise product between 2 spark_connection() Connection between R and the
Works like dplyr mutate function ml_binary_classification_eval(predicted_tbl_spark, label,
columns Spark shell process
sdf_partition(x, ..., weights = NULL, seed spark_jobj() Instance of a remote Spark object score, metric = "areaUnderROC")
ft_index_to_string()
= sample (.Machine$integer.max, 1)) spark_dataframe() Instance of a remote Spark ml_classification_eval(predicted_tbl_spark, label, predicted_lbl,
Index labels back to label as strings
sdf_partition(x, training = 0.5, test = 0.5) ft_one_hot_encoder() DataFrame object metric = "f1")
sdf_register(x, name = NULL) Continuous to binary vectors ml_tree_feature_importance(sc, model)
ft_quantile_discretizer( n.buckets Call Spark from R
Gives a Spark DataFrame a table name
= 5L) invoke() Call a method on a Java object
sdf_sample(x, fraction = 1, replacement = spark_jobj() Create a new object by invoking a
Continuous to binned categorical invoke_new()
TRUE, seed = NULL) constructor
values spark_dataframe()
sdf_sort(x, columns) ft_sql_transformer(sql) invoke_static() Call a static method on an object sparklyr
Sorts by >=1 columns in ascending order ft_string_indexer( params = NULL) is an R
Column of labels into a column of Machine Learning Extensions interface
sdf_with_unique_id(x, id = "id")
label indices. ml_create_dummy_variables() ml_options()
Add unique ID column ft_vector_assembler() for
sdf_predict(object, newdata) ml_prepare_dataframe() ml_model()
Combine vectors into a single row-
Spark DataFrame with predicted values vector ml_prepare_response_features_intercept()
RStudio is a trademark of RStudio, Inc. CC BY RStudio [email protected] 844-448-1212 rstudio.com Learn more at spark.rstudio.com package version 0.5 Updated: 12/21/16

You might also like