0% found this document useful (0 votes)
110 views84 pages

01-DS320-v67-Course Introduction PDF

Uploaded by

Đức Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views84 pages

01-DS320-v67-Course Introduction PDF

Uploaded by

Đức Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

DS320 DataStax Enterprise

Analytics with Apache SparkTM


Introduction

© 2019 DataStax. 1
Use only with permission. academy.datastax.com
Introduction

© 2019 DataStax. 2
Use only with permission. academy.datastax.com
Analysis Steps
• Statistical Analysis
• Classification
• Clustering
• Regression
• Similarity Matching
• Collaborative Filtering
• Profiling
• Dimensionality Reduction
• Feature Extraction

© 2019 DataStax. 3
Use only with permission. academy.datastax.com
DSE Analytics Overview
• DSE Analytics
• Based on Apache SparkTM
• Benefits of DSE Analytics
• Always on
• Workload isolation
• Faster operational analytics than
open source
• Create personalized experiences
• Deliver real-time insights
• Process data to gain a 360-degree
view of customer

© 2019 DataStax. 4
Use only with permission. academy.datastax.com
Why Analytics?
• Data tells the story behind the operations
• Data has meaning
• Data is important to the business
• The business can then make decisions based on the information
• Ask yourself the following question:
• How does the meaning of your company's data help your company?

© 2019 DataStax. 5
Use only with permission. academy.datastax.com
Apache Cassandra™ vs. DSE Analytics
• Apache Cassandra™ queries on the partition key and clustering columns
• Models are not built around relations nor around objects
• Models are built around your queries
• Queries focus on OLTP performance
• DSE Analytics (Spark) frees users to query on any field using familiar
RDMBS patterns
• This includes aggregates, joins, group by, etc.
• This can't be solved using SQL queries or even CQL only
• DSE Analytics provides an interface into querying data

© 2019 DataStax. 6
Use only with permission. academy.datastax.com
What's an Analytical Query?
• Examples of Analytical Queries:
• Number of videos viewed by each user
• Number of videos viewed by each user in each genre
• Average rating per video
• Most popular videos
• Trending videos
• Video recommendations
• In Summary:
• Analytical queries are queries that can affect business decisions and customers'
purchasing decisions

© 2019 DataStax. 7
Use only with permission. academy.datastax.com
Using the Driver?
• Why not use one of the drivers and perform these queries on the
application side?
• Involves significant "plumbing code"
• Requires CQL querying appropriate tables
• Cons of pulling all data into a single application:
• Network cost
• Memory limitations
• Limited concurrency

© 2019 DataStax. 8
Use only with permission. academy.datastax.com
Using DSE Analytics
• SQL Syntax
• Developers are familiar with SQL
• Queries the tables directly using succinct Scala syntax
• More declarative
• Other APIs are available
• Java, Python, R
• DSE Analytics runs only in individual datacenters and uses multiple
machines to compute the results
• Data locality
• Failure tolerance
• Checkpointing

© 2019 DataStax. 9
Use only with permission. academy.datastax.com
Word Counting in a Distributed Environment
Input Splitting Mapping Shuffling Reducing Final Result

List(K2, V2) K2, List(V2)


K1, V1
Alpha, 1 Bravo, (1,1) Bravo, 2
Alpha Bravo Charlie Bravo, 1
List(K3,V3)
Charlie, 1

Delta, (1,1,1) Delta, 3 Bravo, 2


Alpha Bravo Charlie Delta, 1 Delta, 3
Delta Delta Charlie Delta Delta Charlie Delta, 1 Alpha, 2
Alpha Delta Bravo Charlie, 1 Charlie, 2
Alpha, (1,1) Alpha, 2

Alpha,1
Alpha Delta Bravo
Delta, 1
Bravo, 1 Charlie, (1,1) Charlie, 2

© 2019 DataStax. 10
Use only with permission. academy.datastax.com
Why Was Map-Reduce so Revolutionary?
• Most analytical queries can be broken down into mapping and reducing
• Map-reduce is a generic process
• It spreads data among several machines to achieve Parallelism

© 2019 DataStax. 11
Use only with permission. academy.datastax.com
Verbose Code in a Hadoop Implementation
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); Write A


private Text word = new Text(); Mapper
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken()); Write a
context.write(word, one); Reducer
}
}
} public void reduce(Text key, Iterable<IntWritable> values, Context
context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
} result.set(sum);
context.write(key, result);
}
}

© 2019 DataStax. 12
Use only with permission. academy.datastax.com
Spark Map Reduce
• Fluent API

val counts = words


.map(genre => (word,1))
.reduceByKey{case (x,y) => x + y}

© 2019 DataStax. 13
Use only with permission. academy.datastax.com
Exercise 01.01: Setting Up the
Lab Environment

© 2019 DataStax. 14
Use only with permission. academy.datastax.com
Exercise 01.01: Setting Up the Lab Environment
• In this exercise, you will:
• Setup the virtual machine for this course
• Familiarize yourself with cqlsh

© 2019 DataStax. 15
Use only with permission. academy.datastax.com
Apache SparkTM History

© 2019 DataStax. 16
Use only with permission. academy.datastax.com
Apache Spark™ History (1 of 2)
• Started at UC Berkeley in 2009
• Open sourced in 2010 under BSD license
• Written in Scala, a common language at the University
• Spark was written by students working on a resource manager, and was designed
to "Spark" interest in Mesos
• In 2013 Spark was donated to Apache Software Foundation
• 1000+ contributors in 2019
• Several supported formats and data adapters
• Hive, Avro, Parquet, ORC, JSON, and JDBC
• Able to run on different resource managers and work in a batch or
streaming fashion
© 2019 DataStax. 17
Use only with permission. academy.datastax.com
Apache Spark™ History (2 of 2)
• Developed to address limitations of MapReduce
• Uses memory as much as possible
• Attempting to escape the rigidness of the Map Reduce paradigm
• Latest version is 2.4.3 (September 2019)
• Latest DSE Analytics is based on Spark 2.2.3
• Provides major speed enhancements as of the following versions:
• v1.6 DataFrames
• v2.0 DataSets
• Easy to code
• Provides some real-time analytics

© 2019 DataStax. 18
Use only with permission. academy.datastax.com
Apache Spark™ is Young
• Quick development
• Rapid version changes
• API is updated rapidly
• Still being innovated
• DSE Analytics remains a few releases behind the latest version
• DataStax emphasizes stability over "bleeding edge" releases

© 2019 DataStax. 19
Use only with permission. academy.datastax.com
DSE and Apache SparkTM

© 2019 DataStax. 20
Use only with permission. academy.datastax.com
DSE and Apache Spark™
• DSE allows users to submit jobs to any node
• Knowing which node is master is not required
• Spark in DSE Analytics attempts to schedule tasks on the same node that owns
the data
• Spark master and worker run in the same JVM as DSE
• Two (2) components are at work here:
• The connector used to access data in Apache CassandraTM
• DSE Analytics, which is the customized version of Apache SparkTM

© 2019 DataStax. 21
Use only with permission. academy.datastax.com
Improvements
• Within the connector there are numerous Apache CassandraTM and DSE-
specific enhancements that increase performance
• Those familiar with Spark already will note:
• DSE uses the standalone resource manager, but it has been enhanced with
failover and persistent state stored in Apache CassandraTM

© 2019 DataStax. 22
Use only with permission. academy.datastax.com
Apache Cassandra™ Architecture

+263-1 -263
transaction

C* Client Driver C*

transaction

transaction C* C*

transaction
C* Client Driver

© 2019 DataStax. 23
Use only with permission. academy.datastax.com
Spark Architecture

Master

1 2
3
4 Executor Worker Executor

Spark Driver Spark Driver


Executor Worker Executor
Client SparkContext Client SparkContext

Executor Worker Executor

Request for Allocate Schedule and


1 computational 2 computational 3 Start Executors 4 perform
resources resources computation

© 2019 DataStax. 24
Use only with permission. academy.datastax.com
DSE Integration
Executor
Master Worker
Executor
Spark-Cassandra Connector

C*

Executor Executor
Worker Worker
Executor Executor
Spark-Cassandra Spark-Cassandra
Connector C* C* Connector

© 2019 DataStax. 25
Use only with permission. academy.datastax.com
Jobs, Stages, Tasks, DAG

© 2019 DataStax. 26
Use only with permission. academy.datastax.com
DAG
• Directed Acyclic Graph (DAG)
• DAG is a set of Vertices and Edges, where vertices represent the RDDs and
the edges represent the Operation to be applied on RDD
• In Spark DAG, every edge directs from earlier to later in the sequence
• These operations instigate the Driver, which creates a DAG, and is then submitted
to Spark
• Stages and Tasks
• Stages optimize several transformations together
• Shuffle step is the barrier
• TaskScheduler assigns tasks
• Executors execute tasks

© 2019 DataStax. 27
Use only with permission. academy.datastax.com
TaskScheduler
• The TaskScheduler performs the following functions:
• Responsible for sending the tasks to the cluster
• Handles the running tasks
• Retries the tasks if there are failures
• Mitigates any stragglers
• The Worker creates an executor JVM, but not for any specific task
• The Driver requests the executors
• The Master tells the workers to boot them
• Then, the executors report directly to the Driver
• All coordination of tasks happens between the Driver and the Executor; no
workers are involved
• View the DAG and monitor its operation through the Spark UI

© 2019 DataStax. 28
Use only with permission. academy.datastax.com
Where Does DSEFS Fit?
DataStax Enterprise File System

• DSEFS works with Spark in several ways


• As a shared file system DSEFS can be used for the following:
• Checkpointing;
• Loading .jar files containing jobs, so the executors do not load them on the local
file system or have them be sent by the Driver;
• Storing intermediate files (Parquet is a good choice) with fault tolerance; and
• Storing end results of some type of jobs, e.g. reports.
• Archiving data in DSEFS is a normal practice
• DSEFS is an HDFS-compatible file system

© 2019 DataStax. 29
Use only with permission. academy.datastax.com
DSE Analytics—Search Integration
• Utilize search indexes on tables having such indexes
• Node must be SearchAnalytics enabled
• Use with the RDD API, DataFrame API, or Spark SQL API
• SparkSQL also has automatic recognition of Search indexing enabled
• Example using CQL-based Search predicates without the need for
solr_query:

SELECT id, artist_name FROM music.solr where artist_name


LIKE 'Miles%' LIMIT 10

© 2019 DataStax. 30
Use only with permission. academy.datastax.com
DSE Analytics—Use Cases
• Batch
• Streaming
• Data integrity
• Verify all denormalized data copies sync'ed
• ETL
• DSE Analytics simplifies this immensely
• Hint: Spark SQL
• Machine learning
• ODBC/JDBC connectivity

© 2019 DataStax. 31
Use only with permission. academy.datastax.com
Spark SQL Thrift Server / AlwaysOn SQL
• Spark SQL Thrift Server is now branded as AlwaysOn SQL Server in DSE
• The new Simba driver also has some the capability to connect to any node in the cluster
and has fault tolerance
• Don't confuse the AlwaysOn SQL Server with the deprecated Apache Cassandra™ +
Thrift protocol
• The Thrift server still exists in Spark and handles ODBC/JDBC
• https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
• DataStax updated the Thrift server to the next level rebranding it as Always On SQL
Server (AOSS), adding fault tolerance and caching and working with Simba to confirm
drivers function with the new features
• AlwaysOn SQL Server handles JDBC calls; and can do both reads and writes
• AlwaysOn SQL Server can read data from relational databases and write it to Apache
Cassandra™
• It can also read data from any Spark compatible data source and write to Apache
CassandraTM

© 2019 DataStax. 32
Use only with permission. academy.datastax.com
Apache Spark™ Streaming
• Apache SparkTM Streaming is another "long running" Spark application
• Processes data over time windows
• Apache SparkTM Streaming is commonly associated with the Spark application
• DStream takes a parameter that determines how large or small the micro-
batches are
• Example: Every two seconds it takes all the received data and "does something"

© 2019 DataStax. 33
Use only with permission. academy.datastax.com
DSE Analytics Configuration

© 2019 DataStax. 34
Use only with permission. academy.datastax.com
Configuration
/etc/default/dse
# Enable the DSE Graph service on this node
GRAPH_ENABLED=0

# Start the node in DSE Search mode


SOLR_ENABLED=0

# Start the node in Spark mode


SPARK_ENABLED=1

© 2019 DataStax. 35
Use only with permission. academy.datastax.com
Configuration—dse.yaml
• Found in either:
• /etc/dse/dse.yaml
• <install_dir>/resources/dse/conf/dse.yaml
• Configure additional Spark settings here:
• Spark cluster and application statistics being collected
• Initial Spark worker resources (as a percentage after C*)
• Spark security and encryption
• Multiple Hadoop options
• Workpools and Always On SQL Server (AOSS)
• Spark readiness check
• DSEFS (and its configuration)
• Spark Auditing (more as C* but spark-sql audit shows up)
• Hive settings

© 2019 DataStax. 36
Use only with permission. academy.datastax.com
Configuration—dse-spark-env.sh
• Found in either:
• /etc/dse/spark/dse-spark-env.sh
• <install_dir>/resources/spark/conf/dse-spark-env.sh
• Mainly for defaults regarding running Spark with DSE
• Most environment changes are done in spark-env.sh

© 2019 DataStax. 37
Use only with permission. academy.datastax.com
Configuration—spark-env.sh
• Found in either:
• /etc/dse/spark/spark-env.sh
• <install_dir>/resources/spark/conf/spark-env.sh
• Permits the setting of default CORES and MEMORY across the following:
• Workers
• Executors
• Master
• Driver
• Ability to fine tune memory and processors rather than the generic %
found in the dse.yaml file

© 2019 DataStax. 38
Use only with permission. academy.datastax.com
Configuration – spark-defaults.conf
• Found in the usual Spark configuration locations
• Allows you to pass in default spark properties
• If using encryption specify settings here
• Note this only affects applications which are running on the node with the file,
and only affects applications run through "dse spark-submit"
• Ability to use a different file to set defaults for various apps
• dse spark-submit --properties-file new-properties-file
• There can be only one; if you have something in the spark-defaults.conf, but pass in a
new file, it will ignore the value in the spark-defaults.conf file, but not the file completely
• Property file can be whitespace or = demarcation of property to value
• Default is just that; the default is used for the majority of applications
• Using a secondary properties file can be set per application
• Any property can also be individually configured within the application
© 2019 DataStax. 39
Use only with permission. academy.datastax.com
Exercise 01.02: Working with
Configuration Files

© 2019 DataStax. 40
Use only with permission. academy.datastax.com
Exercise 01.02: Working with Configuration Files

In this exercise, you will:


• Open and explore DSE Analytics configuration files

© 2019 DataStax. 41
Use only with permission. academy.datastax.com
Apache Spark™ Basics

© 2019 DataStax. 42
Use only with permission. academy.datastax.com
Which Three Years Have the Most Videos?

CREATE TABLE videos (


video_id TIMEUUID,
avg_rating FLOAT,
description TEXT,
genres SET<TEXT>,
mpaa_rating TEXT,
• Problem: Can only query this table
on video_id
release_date TIMESTAMP,
release_year INT,
title TEXT,
user_id UUID,
PRIMARY KEY (video_id)
);
© 2019 DataStax. 43
Use only with permission. academy.datastax.com
One (Inferior) Solution
• Write an application that connects via the Apache Cassandra™ Driver
• Processes all the data in the application
• Requires moving all the data to the application
• Places pressure on one (1) machine instead of distributing the workload
throughout the cluster

© 2019 DataStax. 44
Use only with permission. academy.datastax.com
Spark SQL
• Yes, you read that correctly; SQL on top of the Cassandra tables
• OLAP

ubuntu@ds320-node1:~$ dse spark-sql


The log file is at /home/ubuntu/.spark-sql-shell.log
spark-sql>

• DataStax Enterprise makes this distributed architecture simple

© 2019 DataStax. 45
Use only with permission. academy.datastax.com
Which Three Years Have the Most Videos?

spark-sql> SELECT release_year, count(*) as num_videos


> FROM killrvideo.videos
> GROUP BY release_year
> ORDER BY num_videos DESC
> LIMIT 3;
2015 446
2009 295
2011 288
Time taken: 6.093 seconds, Fetched 3 row(s)

© 2019 DataStax. 46
Use only with permission. academy.datastax.com
Hive Query Language
• Apache Spark no longer uses the Hive Query Language
• Spark now has its own SQL parser
• It is a superset of the Hive Query Language
• It is also ANSI SQL compatible
• Minor syntactic differences
• For example: TOP 3 vs. LIMIT 3

© 2019 DataStax. 47
Use only with permission. academy.datastax.com
Exiting Spark SQL Shell
• Type exit;

© 2019 DataStax. 48
Use only with permission. academy.datastax.com
Exercise 01.03: Write SQL
Queries

© 2019 DataStax. 49
Use only with permission. academy.datastax.com
Exercise 01.03: Write SQL Queries

In this exercise, you will:


• Set up environment
• Run SQL Queries

© 2019 DataStax. 50
Use only with permission. academy.datastax.com
Apache Spark™ REPL (Read-
Eval-Print Loop)

© 2019 DataStax. 51
Use only with permission. academy.datastax.com
REPL
• Read-Evaluate-Print Loop (REPL)
• Terminal for Apache Spark™ commands
• Uses Scala
• Nothing to fear here
• Scala is built on the JVM, although it does compile into Java class files
• More terse
• Functional
• Start the REPL by typing the following:
dse spark

© 2019 DataStax. 52
Use only with permission. academy.datastax.com
DSE Spark
ubuntu@ds320-node1:~$ dse spark
The log file is at /home/ubuntu/.spark-shell.log
warning: there was one deprecation warning; re-run with -deprecation for details
New Spark Session
WARN 2019-05-09 21:55:52,448 org.apache.spark.SparkContext: Use an existing SparkContext, some
configuration may not take effect.
Extracting Spark Context
Extracting SqlContext
Spark context Web UI available at http://107.23.178.22:4040
Spark context available as 'sc' (master = dse://?, app id = app-20170509215551-0053).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.3
/_/
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
© 2019 DataStax. 53
Use only with permission. academy.datastax.com
SparkSession
• REPL automatically sets up a variable named spark
• Instance of a SparkSession
• SparkSession is the entry point for all things Spark

© 2019 DataStax. 54
Use only with permission. academy.datastax.com
Spark SQL
• Run SQL commands via spark.sql()
• SQL statement does not require a semicolon because it is no longer in a
Spark SQL shell

spark.sql("""
SELECT release_year, count(*) as num_videos
FROM killrvideo.videos
GROUP BY release_year
ORDER BY num_videos DESC
LIMIT 3""")

© 2019 DataStax. 55
Use only with permission. academy.datastax.com
res1

• REPL makes a variable for every line; you just don't see them when they
are a unit:

scala> 1 = 1
res1: Int = 2
scala> (1 + 1).asInstanceOf[Unit]
scala> res2.getClass // See still defined
res3: Class[Unit] = void // Note that here it is res3
// res2 remains silent

© 2019 DataStax. 56
Use only with permission. academy.datastax.com
Exiting the REPL
• Type :quit
• Or press Control-D

scala> :quit
ubuntu@ds320-node1:~$

© 2019 DataStax. 57
Use only with permission. academy.datastax.com
Exercise 01.04: Using REPL
and spark.sql()

© 2019 DataStax. 58
Use only with permission. academy.datastax.com
Exercise 01.04: Using REPL and spark.sql()
• In this exercise, you will:
• Start REPL
• Run SQL queries using spark.sql

© 2019 DataStax. 59
Use only with permission. academy.datastax.com
Apache Spark™ Architecture

© 2019 DataStax. 60
Use only with permission. academy.datastax.com
Master Web UI
• Open a browser
• Navigate to the following:
http://<your ip address>:7080
• Master UI

© 2019 DataStax. 61
Use only with permission. academy.datastax.com
Which Node is the Master?

ubuntu@ds320-node1:~$ dse client-tool spark master-address


dse://54.193.32.55:9042?connection.local_dc=DC1;connection.host=;
ubuntu@ds320-node1:~$ dsetool ring
Address DC Rack Workload Graph Status State Load
Owns Token Health [0,1]
172.31.17.26 DC1 rack1 Analytics(SM) no Up Normal
74.45 MiB ? -77... 0.50
ubuntu@ds320-node1:~$ dsetool status

© 2019 DataStax. 63
Use only with permission. academy.datastax.com
Spark Architecture 1000 Foot View
• Master
• Bookkeeper
• Worker
• Middle management
• Executor
• Does work
• Driver
• Your application

© 2019 DataStax. 64
Use only with permission. academy.datastax.com
Master
• The Master tracks the total resources as reported by each worker
• It is the "Bookkeeper"
• One per data-center (not per cluster)
• Keeps track of where everything is running in the system
• Assigns executors to the Driver
• Commands the worker to launch the executors
• Identifies what resources are available
• CPU cores
• Memory
• Hosts its own admin UI
• Makes very few decisions
• Driver contacts master to get resources

© 2019 DataStax. 65
Use only with permission. academy.datastax.com
Worker
• Each worker defines how many resources are available on the node it
manages
• Job Functions:
• Middle management
• One instance per node
• Does not perform much "work"
• They do what they are told to do and keep track of their little corner of the world
• Forks executors
• Reports back to Master
• Not interesting; not important

© 2019 DataStax. 66
Use only with permission. academy.datastax.com
Executor
• Job Functions:
• The workhorse
• Does what it is told to do
• Makes very few decisions
• Generally one executor per node
• Does not take directions from the worker or the master
• Directions come from the Driver
• Runs in its own JVM
• One executor per core allocated to Spark on that machine
• Ships results back to Driver
• Generally, an executor has several "cores"
• For example: 4 nodes and 20 requested cores would result in 4 (5-core) executors
• If there is only a single executor per core, then it was configured that way
© 2019 DataStax. 67
Use only with permission. academy.datastax.com
Driver (1 of 2)
• Job Functions:
• Your application – in charge
• Delegates responsibilities to the team
• "You'll do this, and you'll do that, etc."
• "Come back to me when you are finished with those pieces."
• Decides what executors do and don't do
• Driver decides how to slice and dice up the work, then assigns it out to individual
executors
• Allocates tasks
• Note that data is only "collected" if a user asks for it
• By default the driver will not pull any data back to itself

© 2019 DataStax. 68
Use only with permission. academy.datastax.com
Driver (2 of 2)
• Determines data locality
• Keep data on nodes where it's originally located
• Unless that node has too much data already
• Makes all the decisions
• Creates and owns a SparkContext instance
• Admin must always make certain the Driver has enough physical
resources

© 2019 DataStax. 69
Use only with permission. academy.datastax.com
Data Flow
• Executors contact the Driver directly
• Executors pull code from the Driver directly
• Executors pull the application jar, but all data they work with is pushed
from the driver OR pulled from other executors
• The executors cannot pull data directly from the driver it must be sent as
part of their Task metadata

© 2019 DataStax. 70
Use only with permission. academy.datastax.com
Master, Master, Who is the Master?
• Spark has a master/slave architecture
• Problem is…which node is the Master?
• DSE handles automatic leader election
• Also handles re-electing a new leader
• No need to worry about who is the Master when using DSE
• DSE looks the Master up for you
• While spark does use a master/worker architecture this is most accurate in
context of the driver/executors
• The master/workers are the resource manager which Spark uses to allocate
resources

© 2019 DataStax. 71
Use only with permission. academy.datastax.com
Spark Submission Modes
• Client mode
• Driver is external to the cluster (maybe)
• Cluster mode
• A message describing how to start the driver JVM is sent to the master
• The Master uses this description to start the application on a Worker
• Because the message does not contain the application jar, the jar path must be
accessible from the workers where the JVM will be started (DSEFS,
NetworkMountpoint, etc …)
• Supervisor mode
• Not separate from the Cluster mode, an additional flag for cluster mode (cluster
mode must still be enabled) which has the master restart non-zero exits of the
driver JVM

© 2019 DataStax. 72
Use only with permission. academy.datastax.com
Scheduling—FIFO

© 2019 DataStax. 73
Use only with permission. academy.datastax.com
Scheduling—Fair

© 2019 DataStax. 74
Use only with permission. academy.datastax.com
Scheduling—Config
val conf = new
SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

© 2019 DataStax. 75
Use only with permission. academy.datastax.com
Exercise 01.05: Master Web
UI

© 2019 DataStax. 76
Use only with permission. academy.datastax.com
Exercise 01.05: Master Web UI
In this exercise, you will:
• Familiarize yourself with Master UI

© 2019 DataStax. 77
Use only with permission. academy.datastax.com
New in Spark 2.4 and DSE 6.7

© 2019 DataStax. 78
Use only with permission. academy.datastax.com
Changes Since DSE 5.1
• Features—DSE uses Scala 2.11.12 with Spark 2.2, Spark Jobserver 0.8.0
• Performance—DSE-specific optimizations in Spark SQL query planner
• New APIs—Structured Streaming
• Availability—AlwaysOn SQL Server (AOSS)
• Manageability—Improved Spark Resource Manager Workpools

© 2019 DataStax. 79
Use only with permission. academy.datastax.com
What’s new in DSE 6.7?
Features and Functionality Updates

• Spark Integration with DataStax Studio—Discussed in detail later


• DSE-Aware Query Planner
• Improved Security
• Improved DSEFS functionality

© 2019 DataStax. 80
Use only with permission. academy.datastax.com
DSE Aware Query Planner
Improving Performance

• Use DSE-Search for much faster filtering on non-primary key columns or


counting
• Direct joins or DSE Search filtering not always faster than a full table
scan—optimizer will guess which method to use

© 2019 DataStax. 81
Use only with permission. academy.datastax.com
DSE Aware Query Planner—Details
Manually Enable or Disable Optimizations
spark.sql.dse.search.enable_optimization
on, off, or auto
spark.sql.dse.search.auto_ratio
defaults to 0.03
// DSE Search is used when (the number of rows * this parameter) >
estimated number of rows returned from DSE Search
direct_join_setting
on, off, or auto
direct_join_size_ratio
defaults to 0.9
// Direct join is used when (the number of rows * this param) > the
other side of the join

© 2019 DataStax. 82
Use only with permission. academy.datastax.com
DSEFS Functionality
DSEFS Improvement in DSE 6.7

• Security—Internode and client-server connection security with TLS/SSL


• Ease of Use—Path expansion with wildcards
• Performance—Improved handling of directories with 100k+ entries
• Performance—Option to disable fsync
• Manageability—Error messages properly back-propagated to the client
• Manageability—Improved logging

© 2019 DataStax. 83
Use only with permission. academy.datastax.com
DSE Analytics Security Features
Security Improvements in DSE 6.7

• Support for all authentication and authorization schemes in Spark SQL


Server
• Authorization in Spark Master/Worker UIs
• Kerberos support in DSEFS was present but limited
• Improved support
• Transitional mode support in DSEFS
• Internode and client-server encryption with TLS in DSEFS

© 2019 DataStax. 84
Use only with permission. academy.datastax.com

You might also like