0% found this document useful (0 votes)

110 views84 pages

01-DS320-v67-Course Introduction PDF

Uploaded by

Đức Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views84 pages

01-DS320-v67-Course Introduction PDF

Uploaded by

Đức Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

DS320 DataStax Enterprise

Analytics with Apache SparkTM

Introduction

© 2019 DataStax. 2
Use only with permission. academy.datastax.com
Analysis Steps
• Statistical Analysis
• Classification
• Clustering
• Regression
• Similarity Matching
• Collaborative Filtering
• Profiling
• Dimensionality Reduction
• Feature Extraction

© 2019 DataStax. 3
Use only with permission. academy.datastax.com
DSE Analytics Overview
• DSE Analytics
• Based on Apache SparkTM
• Benefits of DSE Analytics
• Always on
• Workload isolation
• Faster operational analytics than
open source
• Create personalized experiences
• Deliver real-time insights
• Process data to gain a 360-degree
view of customer

© 2019 DataStax. 4
Use only with permission. academy.datastax.com
Why Analytics?
• Data tells the story behind the operations
• Data has meaning
• Data is important to the business
• The business can then make decisions based on the information
• Ask yourself the following question:
• How does the meaning of your company's data help your company?

© 2019 DataStax. 5
Use only with permission. academy.datastax.com
Apache Cassandra™ vs. DSE Analytics
• Apache Cassandra™ queries on the partition key and clustering columns
• Models are not built around relations nor around objects
• Models are built around your queries
• Queries focus on OLTP performance
• DSE Analytics (Spark) frees users to query on any field using familiar
RDMBS patterns
• This includes aggregates, joins, group by, etc.
• This can't be solved using SQL queries or even CQL only
• DSE Analytics provides an interface into querying data

© 2019 DataStax. 6
Use only with permission. academy.datastax.com
What's an Analytical Query?
• Examples of Analytical Queries:
• Number of videos viewed by each user
• Number of videos viewed by each user in each genre
• Average rating per video
• Most popular videos
• Trending videos
• Video recommendations
• In Summary:
• Analytical queries are queries that can affect business decisions and customers'
purchasing decisions

© 2019 DataStax. 7
Use only with permission. academy.datastax.com
Using the Driver?
• Why not use one of the drivers and perform these queries on the
application side?
• Involves significant "plumbing code"
• Requires CQL querying appropriate tables
• Cons of pulling all data into a single application:
• Network cost
• Memory limitations
• Limited concurrency

© 2019 DataStax. 8
Use only with permission. academy.datastax.com
Using DSE Analytics
• SQL Syntax
• Developers are familiar with SQL
• Queries the tables directly using succinct Scala syntax
• More declarative
• Other APIs are available
• Java, Python, R
• DSE Analytics runs only in individual datacenters and uses multiple
machines to compute the results
• Data locality
• Failure tolerance
• Checkpointing

© 2019 DataStax. 9
Use only with permission. academy.datastax.com
Word Counting in a Distributed Environment
Input Splitting Mapping Shuffling Reducing Final Result

List(K2, V2) K2, List(V2)

K1, V1
Alpha, 1 Bravo, (1,1) Bravo, 2
Alpha Bravo Charlie Bravo, 1
List(K3,V3)
Charlie, 1

Delta, (1,1,1) Delta, 3 Bravo, 2

Alpha Bravo Charlie Delta, 1 Delta, 3
Delta Delta Charlie Delta Delta Charlie Delta, 1 Alpha, 2
Alpha Delta Bravo Charlie, 1 Charlie, 2
Alpha, (1,1) Alpha, 2

Alpha,1
Alpha Delta Bravo
Delta, 1
Bravo, 1 Charlie, (1,1) Charlie, 2

© 2019 DataStax. 10
Use only with permission. academy.datastax.com
Why Was Map-Reduce so Revolutionary?
• Most analytical queries can be broken down into mapping and reducing
• Map-reduce is a generic process
• It spreads data among several machines to achieve Parallelism

© 2019 DataStax. 11
Use only with permission. academy.datastax.com
Verbose Code in a Hadoop Implementation
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); Write A

private Text word = new Text(); Mapper
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken()); Write a
context.write(word, one); Reducer
}
}
} public void reduce(Text key, Iterable<IntWritable> values, Context
context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
} result.set(sum);
context.write(key, result);
}
}

val counts = words

.map(genre => (word,1))
.reduceByKey{case (x,y) => x + y}

© 2019 DataStax. 14
Use only with permission. academy.datastax.com
Exercise 01.01: Setting Up the Lab Environment
• In this exercise, you will:
• Setup the virtual machine for this course
• Familiarize yourself with cqlsh

© 2019 DataStax. 16
Use only with permission. academy.datastax.com
Apache Spark™ History (1 of 2)
• Started at UC Berkeley in 2009
• Open sourced in 2010 under BSD license
• Written in Scala, a common language at the University
• Spark was written by students working on a resource manager, and was designed
to "Spark" interest in Mesos
• In 2013 Spark was donated to Apache Software Foundation
• 1000+ contributors in 2019
• Several supported formats and data adapters
• Hive, Avro, Parquet, ORC, JSON, and JDBC
• Able to run on different resource managers and work in a batch or
streaming fashion
© 2019 DataStax. 17
Use only with permission. academy.datastax.com
Apache Spark™ History (2 of 2)
• Developed to address limitations of MapReduce
• Uses memory as much as possible
• Attempting to escape the rigidness of the Map Reduce paradigm
• Latest version is 2.4.3 (September 2019)
• Latest DSE Analytics is based on Spark 2.2.3
• Provides major speed enhancements as of the following versions:
• v1.6 DataFrames
• v2.0 DataSets
• Easy to code
• Provides some real-time analytics

© 2019 DataStax. 18
Use only with permission. academy.datastax.com
Apache Spark™ is Young
• Quick development
• Rapid version changes
• API is updated rapidly
• Still being innovated
• DSE Analytics remains a few releases behind the latest version
• DataStax emphasizes stability over "bleeding edge" releases

© 2019 DataStax. 20
Use only with permission. academy.datastax.com
DSE and Apache Spark™
• DSE allows users to submit jobs to any node
• Knowing which node is master is not required
• Spark in DSE Analytics attempts to schedule tasks on the same node that owns
the data
• Spark master and worker run in the same JVM as DSE
• Two (2) components are at work here:
• The connector used to access data in Apache CassandraTM
• DSE Analytics, which is the customized version of Apache SparkTM

© 2019 DataStax. 21
Use only with permission. academy.datastax.com
Improvements
• Within the connector there are numerous Apache CassandraTM and DSE-
specific enhancements that increase performance
• Those familiar with Spark already will note:
• DSE uses the standalone resource manager, but it has been enhanced with
failover and persistent state stored in Apache CassandraTM

+263-1 -263
transaction

C* Client Driver C*

transaction

transaction C* C*

transaction
C* Client Driver

Master

1 2
3
4 Executor Worker Executor

Spark Driver Spark Driver

Executor Worker Executor
Client SparkContext Client SparkContext

Executor Worker Executor

Request for Allocate Schedule and

1 computational 2 computational 3 Start Executors 4 perform
resources resources computation

© 2019 DataStax. 24
Use only with permission. academy.datastax.com
DSE Integration
Executor
Master Worker
Executor
Spark-Cassandra Connector

Executor Executor
Worker Worker
Executor Executor
Spark-Cassandra Spark-Cassandra
Connector C* C* Connector

© 2019 DataStax. 26
Use only with permission. academy.datastax.com
DAG
• Directed Acyclic Graph (DAG)
• DAG is a set of Vertices and Edges, where vertices represent the RDDs and
the edges represent the Operation to be applied on RDD
• In Spark DAG, every edge directs from earlier to later in the sequence
• These operations instigate the Driver, which creates a DAG, and is then submitted
to Spark
• Stages and Tasks
• Stages optimize several transformations together
• Shuffle step is the barrier
• TaskScheduler assigns tasks
• Executors execute tasks

© 2019 DataStax. 27
Use only with permission. academy.datastax.com
TaskScheduler
• The TaskScheduler performs the following functions:
• Responsible for sending the tasks to the cluster
• Handles the running tasks
• Retries the tasks if there are failures
• Mitigates any stragglers
• The Worker creates an executor JVM, but not for any specific task
• The Driver requests the executors
• The Master tells the workers to boot them
• Then, the executors report directly to the Driver
• All coordination of tasks happens between the Driver and the Executor; no
workers are involved
• View the DAG and monitor its operation through the Spark UI

• DSEFS works with Spark in several ways

• As a shared file system DSEFS can be used for the following:
• Checkpointing;
• Loading .jar files containing jobs, so the executors do not load them on the local
file system or have them be sent by the Driver;
• Storing intermediate files (Parquet is a good choice) with fault tolerance; and
• Storing end results of some type of jobs, e.g. reports.
• Archiving data in DSEFS is a normal practice
• DSEFS is an HDFS-compatible file system

© 2019 DataStax. 29
Use only with permission. academy.datastax.com
DSE Analytics—Search Integration
• Utilize search indexes on tables having such indexes
• Node must be SearchAnalytics enabled
• Use with the RDD API, DataFrame API, or Spark SQL API
• SparkSQL also has automatic recognition of Search indexing enabled
• Example using CQL-based Search predicates without the need for
solr_query:

SELECT id, artist_name FROM music.solr where artist_name

LIKE 'Miles%' LIMIT 10

© 2019 DataStax. 30
Use only with permission. academy.datastax.com
DSE Analytics—Use Cases
• Batch
• Streaming
• Data integrity
• Verify all denormalized data copies sync'ed
• ETL
• DSE Analytics simplifies this immensely
• Hint: Spark SQL
• Machine learning
• ODBC/JDBC connectivity

© 2019 DataStax. 31
Use only with permission. academy.datastax.com
Spark SQL Thrift Server / AlwaysOn SQL
• Spark SQL Thrift Server is now branded as AlwaysOn SQL Server in DSE
• The new Simba driver also has some the capability to connect to any node in the cluster
and has fault tolerance
• Don't confuse the AlwaysOn SQL Server with the deprecated Apache Cassandra™ +
Thrift protocol
• The Thrift server still exists in Spark and handles ODBC/JDBC
• https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
• DataStax updated the Thrift server to the next level rebranding it as Always On SQL
Server (AOSS), adding fault tolerance and caching and working with Simba to confirm
drivers function with the new features
• AlwaysOn SQL Server handles JDBC calls; and can do both reads and writes
• AlwaysOn SQL Server can read data from relational databases and write it to Apache
Cassandra™
• It can also read data from any Spark compatible data source and write to Apache
CassandraTM

© 2019 DataStax. 32
Use only with permission. academy.datastax.com
Apache Spark™ Streaming
• Apache SparkTM Streaming is another "long running" Spark application
• Processes data over time windows
• Apache SparkTM Streaming is commonly associated with the Spark application
• DStream takes a parameter that determines how large or small the micro-
batches are
• Example: Every two seconds it takes all the received data and "does something"

© 2019 DataStax. 34
Use only with permission. academy.datastax.com
Configuration
/etc/default/dse
# Enable the DSE Graph service on this node
GRAPH_ENABLED=0

# Start the node in DSE Search mode

SOLR_ENABLED=0

# Start the node in Spark mode

SPARK_ENABLED=1

© 2019 DataStax. 35
Use only with permission. academy.datastax.com
Configuration—dse.yaml
• Found in either:
• /etc/dse/dse.yaml
• <install_dir>/resources/dse/conf/dse.yaml
• Configure additional Spark settings here:
• Spark cluster and application statistics being collected
• Initial Spark worker resources (as a percentage after C*)
• Spark security and encryption
• Multiple Hadoop options
• Workpools and Always On SQL Server (AOSS)
• Spark readiness check
• DSEFS (and its configuration)
• Spark Auditing (more as C* but spark-sql audit shows up)
• Hive settings

© 2019 DataStax. 36
Use only with permission. academy.datastax.com
Configuration—dse-spark-env.sh
• Found in either:
• /etc/dse/spark/dse-spark-env.sh
• <install_dir>/resources/spark/conf/dse-spark-env.sh
• Mainly for defaults regarding running Spark with DSE
• Most environment changes are done in spark-env.sh

© 2019 DataStax. 37
Use only with permission. academy.datastax.com
Configuration—spark-env.sh
• Found in either:
• /etc/dse/spark/spark-env.sh
• <install_dir>/resources/spark/conf/spark-env.sh
• Permits the setting of default CORES and MEMORY across the following:
• Workers
• Executors
• Master
• Driver
• Ability to fine tune memory and processors rather than the generic %
found in the dse.yaml file

© 2019 DataStax. 38
Use only with permission. academy.datastax.com
Configuration – spark-defaults.conf
• Found in the usual Spark configuration locations
• Allows you to pass in default spark properties
• If using encryption specify settings here
• Note this only affects applications which are running on the node with the file,
and only affects applications run through "dse spark-submit"
• Ability to use a different file to set defaults for various apps
• dse spark-submit --properties-file new-properties-file
• There can be only one; if you have something in the spark-defaults.conf, but pass in a
new file, it will ignore the value in the spark-defaults.conf file, but not the file completely
• Property file can be whitespace or = demarcation of property to value
• Default is just that; the default is used for the majority of applications
• Using a secondary properties file can be set per application
• Any property can also be individually configured within the application
© 2019 DataStax. 39
Use only with permission. academy.datastax.com
Exercise 01.02: Working with
Configuration Files

In this exercise, you will:

• Open and explore DSE Analytics configuration files

CREATE TABLE videos (

video_id TIMEUUID,
avg_rating FLOAT,
description TEXT,
genres SET<TEXT>,
mpaa_rating TEXT,
• Problem: Can only query this table
on video_id
release_date TIMESTAMP,
release_year INT,
title TEXT,
user_id UUID,
PRIMARY KEY (video_id)
);
© 2019 DataStax. 43
Use only with permission. academy.datastax.com
One (Inferior) Solution
• Write an application that connects via the Apache Cassandra™ Driver
• Processes all the data in the application
• Requires moving all the data to the application
• Places pressure on one (1) machine instead of distributing the workload
throughout the cluster

© 2019 DataStax. 44
Use only with permission. academy.datastax.com
Spark SQL
• Yes, you read that correctly; SQL on top of the Cassandra tables
• OLAP

ubuntu@ds320-node1:~$ dse spark-sql

The log file is at /home/ubuntu/.spark-sql-shell.log
spark-sql>

• DataStax Enterprise makes this distributed architecture simple

spark-sql> SELECT release_year, count(*) as num_videos

> FROM killrvideo.videos
> GROUP BY release_year
> ORDER BY num_videos DESC
> LIMIT 3;
2015 446
2009 295
2011 288
Time taken: 6.093 seconds, Fetched 3 row(s)

© 2019 DataStax. 46
Use only with permission. academy.datastax.com
Hive Query Language
• Apache Spark no longer uses the Hive Query Language
• Spark now has its own SQL parser
• It is a superset of the Hive Query Language
• It is also ANSI SQL compatible
• Minor syntactic differences
• For example: TOP 3 vs. LIMIT 3

In this exercise, you will:

• Set up environment
• Run SQL Queries

© 2019 DataStax. 51
Use only with permission. academy.datastax.com
REPL
• Read-Evaluate-Print Loop (REPL)
• Terminal for Apache Spark™ commands
• Uses Scala
• Nothing to fear here
• Scala is built on the JVM, although it does compile into Java class files
• More terse
• Functional
• Start the REPL by typing the following:
dse spark

© 2019 DataStax. 52
Use only with permission. academy.datastax.com
DSE Spark
ubuntu@ds320-node1:~$ dse spark
The log file is at /home/ubuntu/.spark-shell.log
warning: there was one deprecation warning; re-run with -deprecation for details
New Spark Session
WARN 2019-05-09 21:55:52,448 org.apache.spark.SparkContext: Use an existing SparkContext, some
configuration may not take effect.
Extracting Spark Context
Extracting SqlContext
Spark context Web UI available at http://107.23.178.22:4040
Spark context available as 'sc' (master = dse://?, app id = app-20170509215551-0053).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.3
/_/
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
© 2019 DataStax. 53
Use only with permission. academy.datastax.com
SparkSession
• REPL automatically sets up a variable named spark
• Instance of a SparkSession
• SparkSession is the entry point for all things Spark

© 2019 DataStax. 54
Use only with permission. academy.datastax.com
Spark SQL
• Run SQL commands via spark.sql()
• SQL statement does not require a semicolon because it is no longer in a
Spark SQL shell

spark.sql("""
SELECT release_year, count(*) as num_videos
FROM killrvideo.videos
GROUP BY release_year
ORDER BY num_videos DESC
LIMIT 3""")

• REPL makes a variable for every line; you just don't see them when they
are a unit:

scala> 1 = 1
res1: Int = 2
scala> (1 + 1).asInstanceOf[Unit]
scala> res2.getClass // See still defined
res3: Class[Unit] = void // Note that here it is res3
// res2 remains silent

scala> :quit
ubuntu@ds320-node1:~$

© 2019 DataStax. 58
Use only with permission. academy.datastax.com
Exercise 01.04: Using REPL and spark.sql()
• In this exercise, you will:
• Start REPL
• Run SQL queries using spark.sql

© 2019 DataStax. 60
Use only with permission. academy.datastax.com
Master Web UI
• Open a browser
• Navigate to the following:
http://<your ip address>:7080
• Master UI

ubuntu@ds320-node1:~$ dse client-tool spark master-address

dse://54.193.32.55:9042?connection.local_dc=DC1;connection.host=;
ubuntu@ds320-node1:~$ dsetool ring
Address DC Rack Workload Graph Status State Load
Owns Token Health [0,1]
172.31.17.26 DC1 rack1 Analytics(SM) no Up Normal
74.45 MiB ? -77... 0.50
ubuntu@ds320-node1:~$ dsetool status

© 2019 DataStax. 63
Use only with permission. academy.datastax.com
Spark Architecture 1000 Foot View
• Master
• Bookkeeper
• Worker
• Middle management
• Executor
• Does work
• Driver
• Your application

© 2019 DataStax. 64
Use only with permission. academy.datastax.com
Master
• The Master tracks the total resources as reported by each worker
• It is the "Bookkeeper"
• One per data-center (not per cluster)
• Keeps track of where everything is running in the system
• Assigns executors to the Driver
• Commands the worker to launch the executors
• Identifies what resources are available
• CPU cores
• Memory
• Hosts its own admin UI
• Makes very few decisions
• Driver contacts master to get resources

© 2019 DataStax. 65
Use only with permission. academy.datastax.com
Worker
• Each worker defines how many resources are available on the node it
manages
• Job Functions:
• Middle management
• One instance per node
• Does not perform much "work"
• They do what they are told to do and keep track of their little corner of the world
• Forks executors
• Reports back to Master
• Not interesting; not important

© 2019 DataStax. 66
Use only with permission. academy.datastax.com
Executor
• Job Functions:
• The workhorse
• Does what it is told to do
• Makes very few decisions
• Generally one executor per node
• Does not take directions from the worker or the master
• Directions come from the Driver
• Runs in its own JVM
• One executor per core allocated to Spark on that machine
• Ships results back to Driver
• Generally, an executor has several "cores"
• For example: 4 nodes and 20 requested cores would result in 4 (5-core) executors
• If there is only a single executor per core, then it was configured that way
© 2019 DataStax. 67
Use only with permission. academy.datastax.com
Driver (1 of 2)
• Job Functions:
• Your application – in charge
• Delegates responsibilities to the team
• "You'll do this, and you'll do that, etc."
• "Come back to me when you are finished with those pieces."
• Decides what executors do and don't do
• Driver decides how to slice and dice up the work, then assigns it out to individual
executors
• Allocates tasks
• Note that data is only "collected" if a user asks for it
• By default the driver will not pull any data back to itself

© 2019 DataStax. 68
Use only with permission. academy.datastax.com
Driver (2 of 2)
• Determines data locality
• Keep data on nodes where it's originally located
• Unless that node has too much data already
• Makes all the decisions
• Creates and owns a SparkContext instance
• Admin must always make certain the Driver has enough physical
resources

© 2019 DataStax. 69
Use only with permission. academy.datastax.com
Data Flow
• Executors contact the Driver directly
• Executors pull code from the Driver directly
• Executors pull the application jar, but all data they work with is pushed
from the driver OR pulled from other executors
• The executors cannot pull data directly from the driver it must be sent as
part of their Task metadata

© 2019 DataStax. 70
Use only with permission. academy.datastax.com
Master, Master, Who is the Master?
• Spark has a master/slave architecture
• Problem is…which node is the Master?
• DSE handles automatic leader election
• Also handles re-electing a new leader
• No need to worry about who is the Master when using DSE
• DSE looks the Master up for you
• While spark does use a master/worker architecture this is most accurate in
context of the driver/executors
• The master/workers are the resource manager which Spark uses to allocate
resources

© 2019 DataStax. 71
Use only with permission. academy.datastax.com
Spark Submission Modes
• Client mode
• Driver is external to the cluster (maybe)
• Cluster mode
• A message describing how to start the driver JVM is sent to the master
• The Master uses this description to start the application on a Worker
• Because the message does not contain the application jar, the jar path must be
accessible from the workers where the JVM will be started (DSEFS,
NetworkMountpoint, etc …)
• Supervisor mode
• Not separate from the Cluster mode, an additional flag for cluster mode (cluster
mode must still be enabled) which has the master restart non-zero exits of the
driver JVM

© 2019 DataStax. 74
Use only with permission. academy.datastax.com
Scheduling—Config
val conf = new
SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

© 2019 DataStax. 76
Use only with permission. academy.datastax.com
Exercise 01.05: Master Web UI
In this exercise, you will:
• Familiarize yourself with Master UI

© 2019 DataStax. 78
Use only with permission. academy.datastax.com
Changes Since DSE 5.1
• Features—DSE uses Scala 2.11.12 with Spark 2.2, Spark Jobserver 0.8.0
• Performance—DSE-specific optimizations in Spark SQL query planner
• New APIs—Structured Streaming
• Availability—AlwaysOn SQL Server (AOSS)
• Manageability—Improved Spark Resource Manager Workpools

© 2019 DataStax. 79
Use only with permission. academy.datastax.com
What’s new in DSE 6.7?
Features and Functionality Updates

• Spark Integration with DataStax Studio—Discussed in detail later

• DSE-Aware Query Planner
• Improved Security
• Improved DSEFS functionality

• Use DSE-Search for much faster filtering on non-primary key columns or

counting
• Direct joins or DSE Search filtering not always faster than a full table
scan—optimizer will guess which method to use

© 2019 DataStax. 81
Use only with permission. academy.datastax.com
DSE Aware Query Planner—Details
Manually Enable or Disable Optimizations
spark.sql.dse.search.enable_optimization
on, off, or auto
spark.sql.dse.search.auto_ratio
defaults to 0.03
// DSE Search is used when (the number of rows * this parameter) >
estimated number of rows returned from DSE Search
direct_join_setting
on, off, or auto
direct_join_size_ratio
defaults to 0.9
// Direct join is used when (the number of rows * this param) > the
other side of the join

• Security—Internode and client-server connection security with TLS/SSL

• Ease of Use—Path expansion with wildcards
• Performance—Improved handling of directories with 100k+ entries
• Performance—Option to disable fsync
• Manageability—Error messages properly back-propagated to the client
• Manageability—Improved logging

© 2019 DataStax. 83
Use only with permission. academy.datastax.com
DSE Analytics Security Features
Security Improvements in DSE 6.7

• Support for all authentication and authorization schemes in Spark SQL

Server
• Authorization in Spark Master/Worker UIs
• Kerberos support in DSEFS was present but limited
• Improved support
• Transitional mode support in DSEFS
• Internode and client-server encryption with TLS in DSEFS

ICAEW Assurance WB 2023
100% (1)
ICAEW Assurance WB 2023
382 pages
Sample Certificate of Non-Claim (Car Insurance Claim)
71% (7)
Sample Certificate of Non-Claim (Car Insurance Claim)
1 page
T-Spot Test Results
No ratings yet
T-Spot Test Results
1 page
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
SPARK
No ratings yet
SPARK
47 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
No ratings yet
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
100 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Unit 4
No ratings yet
Unit 4
60 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Bda U4
No ratings yet
Bda U4
49 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit 5
100% (1)
Unit 5
109 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
DataStax Introduction - 20240422
No ratings yet
DataStax Introduction - 20240422
46 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Apache Spark
No ratings yet
Apache Spark
40 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark
No ratings yet
Spark
96 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Spark Scala Protected
No ratings yet
Spark Scala Protected
211 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Module 3
No ratings yet
Module 3
51 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Inspection Preparation For Ships
No ratings yet
Inspection Preparation For Ships
3 pages
North and South
No ratings yet
North and South
18 pages
Lecture 1a
No ratings yet
Lecture 1a
22 pages
En Entl Encl1106 (Подъемники)
No ratings yet
En Entl Encl1106 (Подъемники)
2 pages
Art and Technology in Poland Ed. Agnieszka Jelewska
No ratings yet
Art and Technology in Poland Ed. Agnieszka Jelewska
258 pages
Punzalan, Joshua Mitchell L. Case-Scenarios-NICU
No ratings yet
Punzalan, Joshua Mitchell L. Case-Scenarios-NICU
2 pages
Force & Laws of Motion5
No ratings yet
Force & Laws of Motion5
2 pages
God of War Ghost of Sparta
100% (1)
God of War Ghost of Sparta
32 pages
Eim Q3W8
No ratings yet
Eim Q3W8
47 pages
Construction of Anganwadi Centres: Madhya Pradesh
No ratings yet
Construction of Anganwadi Centres: Madhya Pradesh
4 pages
Cyber Crime Laboratory Manual 2022
No ratings yet
Cyber Crime Laboratory Manual 2022
7 pages
Besongntor Orockakwa
No ratings yet
Besongntor Orockakwa
37 pages
Structural Foundation Sections Sheet 1 of 2
No ratings yet
Structural Foundation Sections Sheet 1 of 2
1 page
Attachment 3
No ratings yet
Attachment 3
11 pages
Project - Up Land Law
No ratings yet
Project - Up Land Law
7 pages
DesignThinking UNIT II
No ratings yet
DesignThinking UNIT II
43 pages
Mahbubur Rahman Ticket
No ratings yet
Mahbubur Rahman Ticket
1 page
All About Bohol
No ratings yet
All About Bohol
5 pages
1.85 Water and Wastewater Treatment Engineering Homework 3
No ratings yet
1.85 Water and Wastewater Treatment Engineering Homework 3
1 page
Writing Your First Django App, Part 7 - Django Documentation - Django
No ratings yet
Writing Your First Django App, Part 7 - Django Documentation - Django
10 pages
Key To Corrections - LEVEL 2 MODULE 3
No ratings yet
Key To Corrections - LEVEL 2 MODULE 3
10 pages
UEFA Euro 2020 Case Study
No ratings yet
UEFA Euro 2020 Case Study
3 pages
Exam 2022 p2 Ans
No ratings yet
Exam 2022 p2 Ans
14 pages
Instruction Manual
No ratings yet
Instruction Manual
2 pages
Lift Manuals - Manuale Delle Parti - CHASSIS, MAST, OPTIONS & INTERNAL HOSING - PDF Tav 4 Ver
No ratings yet
Lift Manuals - Manuale Delle Parti - CHASSIS, MAST, OPTIONS & INTERNAL HOSING - PDF Tav 4 Ver
3 pages
Forrester - Enabling Smarter Procurement
No ratings yet
Forrester - Enabling Smarter Procurement
15 pages
Irish Unemployment p2 Markscheme New
No ratings yet
Irish Unemployment p2 Markscheme New
4 pages

01-DS320-v67-Course Introduction PDF

Uploaded by

01-DS320-v67-Course Introduction PDF

Uploaded by

DS320 DataStax Enterprise

Analytics with Apache SparkTM

List(K2, V2) K2, List(V2)

Delta, (1,1,1) Delta, 3 Bravo, 2

private final static IntWritable one = new IntWritable(1); Write A

val counts = words

Spark Driver Spark Driver

Executor Worker Executor

Request for Allocate Schedule and

• DSEFS works with Spark in several ways

SELECT id, artist_name FROM music.solr where artist_name

# Start the node in DSE Search mode

# Start the node in Spark mode

In this exercise, you will:

CREATE TABLE videos (

ubuntu@ds320-node1:~$ dse spark-sql

• DataStax Enterprise makes this distributed architecture simple

spark-sql> SELECT release_year, count(*) as num_videos

In this exercise, you will:

ubuntu@ds320-node1:~$ dse client-tool spark master-address

• Spark Integration with DataStax Studio—Discussed in detail later

• Use DSE-Search for much faster filtering on non-primary key columns or

• Security—Internode and client-server connection security with TLS/SSL

• Support for all authentication and authorization schemes in Spark SQL

You might also like