0% found this document useful (0 votes)

172 views100 pages

Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF

Uploaded by

Bubu Tripathy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views100 pages

Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF

Uploaded by

Bubu Tripathy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

Lambda Architecture with Spark

Streaming, Kafka, Cassandra, Akka, Scala

Helena Edelson
@helenaedelson

1
Who Is This Person?

• Spark Cassandra Connector committer

• Akka contributor - 2 new features in Akka Cluster
• Big Data & Scala conference speaker
• Currently Sr Software Engineer, Analytics @ DataStax Analytic

• Sr Cloud Engineer, VMware,CrowdStrike,SpringSource…

• Prev Spring committer - Spring AMQP, Spring Integration
Talk Roadmap

What Lambda Architecture & Delivering Meaning

Why Spark, Kafka, Cassandra & Akka integration
How Composable Pipelines - Code

[email protected]
I need fast access
to historical data
on the fly for
predictive modeling
with real time data
from the stream
Lambda Architecture
A data-processing architecture designed to handle massive quantities of
data by taking advantage of both batch and stream processing methods.

• Spark is one of the few data processing frameworks that allows you to
seamlessly integrate batch and stream processing
• Of petabytes of data
• In the same application
Your Code
Moving Data Between Systems Is
Difficult Risky and Expensive

@helenaedelson 9
How Do We Approach This?
Strategies
• Scalable Infrastructure
• Partition For Scale
• Replicate For Resiliency
• Share Nothing
• Asynchronous Message Passing
• Parallelism
• Isolation
• Data Locality
• Location Transparency
My Nerdy Chart
Strategy Technologies

Scalable Infrastructure / Elastic Spark, Cassandra, Kafka

Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster
Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring
Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style
Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka
Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence
Failure Detection Cassandra, Spark, Akka, Kafka
Consensus & Gossip Cassandra & Akka Cluster
Parallelism Spark, Cassandra, Kafka, Akka
Asynchronous Data Passing Kafka, Akka, Spark
Fast, Low Latency, Data Locality Cassandra, Spark, Kafka
Location Transparency Akka, Spark, Cassandra, Kafka
• Fast, distributed, scalable and
fault tolerant cluster compute
system
• Enables Low-latency with
complex analytics
• Developed in 2009 at UC
Berkeley AMPLab, openAnalytic
sourced
in 2010
• Became an Apache project in
Analytic
February, 2014
Search
• High Throughput Distributed Messaging
• Decouples Data Pipelines
• Handles Massive Data Load
• Support Massive Number of Consumers
• Distribution & partitioning across cluster nodes
• Automatic recovery from broker failures
Speaking Of Fault Tolerance…
The one thing in your infrastructure
you can always rely on.
Availability

"During Hurricane Sandy, we lost an entire data center. Completely. Lost. It.
Our data in Cassandra never went offline."

© 2014 DataStax, All Rights Reserved Company

17
Confidential
•Massively Scalable
• High Performance
• Always On
• Masterless
• Fault tolerant
• Hierarchical Supervision
• Customizable Failure Strategies & Detection
• Asynchronous Data Passing
• Parallelization - Balancing Pool Routers
• Akka Cluster
• Adaptive / Predictive
• Load-Balanced Across Cluster Nodes
I’ve used Scala
with these
every single time.
• Stream data from Kafka to Cassandra
• Stream data from Kafka to Spark and write to Cassandra
• Stream from Cassandra to Spark - coming soon!
• Read data from Spark/Spark Streaming Source and write to C*
• Read data from Cassandra to Spark
• Distributed Analytics Platform
• Easy Abstraction for Datasets

• Support in several languages

• Streaming

• Machine Learning

• Graph

• Integrated SQL Queries

• Has Generalized DAG execution

All in one package

HADOOP And it uses Akka
Most Active OSS In Big Data

Search
Apache Spark - Easy to Use API
Returns the top (k) highest temps for any location in the year

def topK(aggregate: Seq[Double]): Seq[Double] =

sc.parallelize(aggregate).top(k).collect

Returns the top (k) highest temps … in a Future Analytic

def topK(aggregate: Seq[Double]): Future[Seq[Double]] =

sc.parallelize(aggregate).top(k).collectAsync
Analytic

Search
Use the Spark Shell to
quickly try out code samples
Available in

Spark Shell

and
Pyspark
Collection To RDD

scala> val data = Array(1, 2, 3, 4, 5) 

data: Array[Int] = Array(1, 2, 3, 4, 5) 
 
scala> val distributedData = sc.parallelize(data) 
distributedData: spark.RDD[Int] =
spark.ParallelCollection@10d13e3e Analytic

Analytic

Search
Not Just MapReduce

© 2014 DataStax, All Rights Reserved Company Confidential

Spark Basic Word Count
val conf = new SparkConf()
.setMaster(host).setAppName(app) 
 
val sc = new SparkContext(conf)

sc.textFile(words)
.flatMap(_.split("\\s+")) Analytic

.map(word => (word.toLowerCase, 1))

.reduceByKey(_ + _)
.collect Analytic

Search
RDDs Can be Generated from a
Variety of Sources
Scala Collections

Textfiles
RDD Operations

Analytic

Transformation
Analytic

Action Search
Setting up C* and Spark

DSE > 4.5.0

Just start your nodes with
dse cassandra -‐k

Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
Analytic

When Batch Is Not Enough Analytic

Your Data Is Like Candy
Delicious: you want it now

Analytic

Search
Your Data Is Like Candy
Delicious: you want it now
Streaming Analytics
Batch Analytics Analytics as data arrives.
Analysis after data has accumulated The data won’t be stale and neither will our analytics
Decreases the weight of the data by the time it is processed

Analytic

Both in same app = Lambda

Spark Streaming

• I want results continuously in the event stream

• I want to run computations in my even-driven async apps
• Exactly once message guarantees
DStream (Discretized Stream)
Continuous stream of micro batches
• Complex processing models with minimal effort
• Streaming computations on small time intervals

DStream

RDD (time 0 to time 1) RDD (time 1 to time 2) RDD (time 2 to time 3)

A transformation on a DStream = transformations on its RDDs

Basic Streaming: FileInputDStream
val conf = new SparkConf().setMaster(SparkMaster).setAppName(AppName)
val ssc = new StreamingContext(conf, Milliseconds(500))

ssc.textFileStream("s3n://raw_data_bucket/")
.flatMap(_.split("\\s+"))
.map(_.toLowerCase, 1)) The batch streaming interval
.countByValue()
.saveToCassandra(keyspace,table)

ssc.checkpoint(checkpointDir)
ssc.start() Starts the streaming application piping
ssc.awaitTermination raw incoming data to a Sink
ReceiverInputDStreams
DStreams - the stream of raw data received from streaming sources:
• Basic Source - in the StreamingContext API
• Advanced Source - in external modules and separate Spark artifacts

Receivers
• Reliable Receivers - for data sources supporting acks (like Kafka)
• Unreliable Receivers - for data sources not supporting acks

39
Spark Streaming External Source/Sink
Streaming Window Operations

kvStream
.flatMap { case (k,v) => (k,v.value) }
.reduceByKeyAndWindow((a:Int,b:Int) =>
(a + b), Seconds(30), Seconds(10))
.saveToCassandra(keyspace,table)

Window Length: Sliding Interval:

Duration = every 10s Interval at which the window operation
is performed = every 10 s
Apache Cassandra
• Scales Linearly to as many nodes as you need
• Scales whenever you need
Scale
Apache Cassandra
• It’s Fast
• Built to sustain massive data insertion rates in
irregular pattern spikes
Performance
Apache Cassandra
• Automatic Replication
Fault • Multi Datacenter
Tolerance • Decentralized - no single point of failure
&
• Survive regional outages
Availability
• New nodes automatically add themselves to
the cluster
• DataStax drivers automatically discover new
nodes
Fault Tolerance & Replication
How many copies of a
data should exist in the cluster? A B

ReplicationFactor=3
C D

US-East Europe

ABD ABC ABD ABC

ACD BCD ACD BCD

© 2014 DataStax, All Rights Reserved Company

47
Confidential
Fault Tolerance & Replication
How many copies of a
data should exist in the cluster? A B

ReplicationFactor=3
US-East C D

Europe

ABD ABC ABD ABC

Cassandra Cluster

ACD BCD ACD BCD

© 2014 DataStax, All Rights Reserved Company

48
Confidential
Apache Cassandra
• Consensus - Paxos Protocol
• Sequential Read / Write - Timeseries
• Tunable Consistency
Strategies • Gossip:

Did you hear node 1

was down??
Apache Cassandra
• Distributed, Masterless Ring Architecture
• Network Topology Aware
Architecture • Flexible, Schemaless - your data structure can evolve
seamlessly over time
C* At CERN: Large Haldron Colider
•ATLAS - Largest of several detectors along the Large Hadron Collider
• Measures particle production when protons collide at a very high
center of mass energy

•- Bursty traffic
•- Volume of data from sensors requires
• - Very large trigger and data acquisition system
• - 30,000 applications on 2,000 nodes
Genetics / Biological Computations
IoT
CQL - Easy
CREATE TABLE users (
username varchar,
• Familiar syntax
firstname varchar, • Many Tools & Drivers
lastname varchar,
email list<varchar>, • Many Languages
password varchar, • Friendly to programmers
created_date timestamp,
PRIMARY KEY (username) • Paxos for locking
);

INSERT INTO users (username, firstname, lastname,

email, password, created_date)
VALUES ('hedelson','Helena','Edelson',
[‘[email protected]'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00’)
IF NOT EXISTS;
Timeseries Data
CREATE TABLE weather.raw_data ( 
wsid text, year int, month int, day int, hour int,  
temperature double, dewpoint double, pressure double,
wind_direction int, wind_speed double, one_hour_precip
PRIMARY KEY ((wsid), year, month, day, hour) 
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

C* Clustering Columns Writes by most recent

Reads return most recent first

Cassandra will automatically sort by most recent for both write and read
A record of every event, in order in which it happened, per URL:

CREATE TABLE IF NOT EXISTS requests_ks.timeline (

timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text,
PRIMARY KEY ((url, timesegment) , t_uuid)
);

timeuuid protects from simultaneous events over-writing one another.

timesegment protects from writing unbounded partitions.

val multipleStreams = (1 to numDstreams).map { i =>

streamingContext.receiverStream[HttpRequest](new HttpReceiver(port))
}

streamingContext.union(multipleStreams)
.map { httpRequest => TimelineRequestEvent(httpRequest)}
.saveToCassandra("requests_ks", "timeline")
Spark Cassandra Connector

@helenaedelson 59
Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
•NOSQL JOINS!
•Write & Read data between Spark and Cassandra
•Compatible with Spark 1.3
•Handles Data Locality for Speed
•Implicit type conversions
•Server-Side Filtering - SELECT, WHERE, etc.
•Natural Timeseries Integration
Spark Cassandra Connector

Spark Executor

User Application

Spark-Cassandra Connector

C* Driver
C*

Cassandra C* C*

C*
Writing and Reading
SparkContext
import com.datastax.spark.connector._

StreamingContext
import com.datastax.spark.connector.streaming._ Analytic

Search
Write from Spark to Cassandra
SparkContext Keyspace Table

sc.parallelize(Seq(0,1,2)).saveToCassandra(“keyspace”, "raw_data")

Analytic

Spark RDD JOIN with NOSQL!

predictionsRdd.join(music).saveToCassandra("music", "predictions")
Read From C* to Spark

CassandraRDD[CassandraRow]

SparkContext Keyspace Table

val rdd = sc.cassandraTable("github", "commits")

.select("user","count","year","month")
Server-Side Column
.where("commits >= ? and year = ?", 1000, 2015)
and Row Filtering
Rows: Custom Objects

StreamingContext CassandraRow Keyspace Table

val rdd = ssc.cassandraTable[MonthlyCommits]("github", "commits_aggregate")

.where("user = ? and project_name = ? and year = ?",
"helena", "spark-‐cassandra-‐connector", 2015)
Rows
val tuplesRdd = sc.cassandraTable[(Int,Date,String)](db, tweetsTable)
.select("cluster_id","time", "cluster_name")
.where("time > ? and time < ?",
"2014-‐07-‐12 20:00:01", "2014-‐07-‐12 20:00:03”)

val rdd = ssc.cassandraTable[MyDataType]("stats", "clustering_time")

.where("key = 1").limit(10).collect

val rdd = ssc.cassandraTable[(Int,DateTime,String)]("stats", "clustering_time")

.where("key = 1").withDescOrder.collect
Cassandra User Defined Types

CREATE TYPE address ( UDT = Your Custom Field Type In Cassandra

street text,
city text,
zip_code int,
country text,
cross_streets set<text>
);
Cassandra UDT’s With JSON
{
"productId": 2, CREATE TYPE dimensions (
"name": "Kitchen Table", units text,
"price": 249.99, length float,
"description" : "Rectangular table with oak finish", width float,
"dimensions": { height float
"units": "inches", );
"length": 50.0,
"width": 66.0,
"height": 32 CREATE TYPE category (
}, catalogPage int,
"categories": { url text
{ );
"category" : "Home Furnishings" {
"catalogPage": 45, CREATE TABLE product (
"url": "/home/furnishings" productId int,
}, name text,
{ price float,
"category" : "Kitchen Furnishings" { description text,
"catalogPage": 108, dimensions frozen <dimensions>,
"url": "/kitchen/furnishings" categories map <text, frozen <category>>,
} PRIMARY KEY (productId)
} );
}
Data Locality
● Spark asks an RDD for a list of its partitions (splits)
● Each split consists of one or more token-ranges
● For every partition
● Spark asks RDD for a list of preferred nodes to process on
● Spark creates a task and sends it to one of the nodes for execution

Every Spark task uses a CQL-like query to fetch data for the given token range:

SELECT "key", "value"   C*

FROM "test"."kv"  
WHERE  
token("key") > 595597420921139321 AND   C* C*
token("key") <= 595597431194200132
ALLOW FILTERING
C*
Cassandra Locates a Row Based on
Partition Key and Token Range

All of the rows in a Cassandra Cluster

are stored based based on their
location in the Token Range.
Cassandra Locates a Row Based on
Partition Key and Token Range
999 0

New York City/

Manhattan:
Helena

Warsaw: St. Petersburg:

Piotr & Jacek Artem

San Francisco: Each of the Nodes in a  

Brian,Russell & Cassandra Cluster is primarily
Alex
responsible for one set of
500 Tokens.
Cassandra Locates a Row Based on
Partition Key and Token Range
999 0

New York City

750 - 99

Warsaw St. Petersburg

350 - 749

Each of the Nodes in a  

San Francisco Cassandra Cluster is primarily
100 - 349
responsible for one set of
500 Tokens.
Cassandra Locates a Row Based on
Partition Key and Token Range

New York City

Jacek 514 Red

Warsaw St. Petersburg

The CQL Schema designates

San Francisco at least one column to be the
Partition Key.
Cassandra Locates a Row Based on
Partition Key and Token Range

New York City

Helena 514 Red

Warsaw St. Petersburg

The hash of the Partition Key

San Francisco tells us where a row
should be stored.
The Spark Executor uses the Connector to
Pull Rows from the Local Cassandra Instance
Amsterdam

SELECT * FROM keyspace.table WHERE

pk =

Spark Executor

The C* Driver pages spark.cassandra.input.page.row.size

CQL rows at a time
DataStax Enterprise Enables This Same Machinery  
with Solr Pushdown
Amsterdam
DataStax
Enterprise
Tokens 780 - 830

SELECT * FROM keyspace.table

WHERE solr_query = 'title:b'
AND
token(pk) > 780 and token(pk) <= 830
Spark Executor (Superman)
Composable Pipelines
With Spark, Kafka & Cassandra

@helenaedelson 77
Spark SQL with Cassandra
import org.apache.spark.sql.cassandra.CassandraSQLContext

val cc = new CassandraSQLContext(sparkContext)

cc.setKeyspace(keyspaceName)
cc.sql("""
SELECT table1.a, table1.b, table.c, table2.a
FROM table1 AS table1
JOIN table2 AS table2 ON table1.a = table2.a
AND table1.b = table2.b
AND table1.c = table2.c
""")
.map(Data(_))
.saveToCassandra(keyspace1, table3)
Spark SQL with Cassandra & JSON
cqlsh> CREATE TABLE github_stats.commits_aggr(user VARCHAR PRIMARY KEY, commits INT…);
 
val sql = new SQLContext(sparkContext)

val json = Seq( 

"""{"user":"helena","commits":98, "month":3, "year":2015}""", 
"""{"user":"jacek-lewandowski", "commits":72, "month":3, "year":2015}""", 
"""{"user":"pkolaczk", "commits":42, "month":3, "year":2015}""")

// write
sql.jsonRDD(json)
.map(CommitStats(_))
.flatMap(compute)
.saveToCassandra("stats","monthly_commits") 

// read
val rdd = sc.cassandraTable[MonthlyCommits]("stats","monthly_commits")
Spark Streaming, Kafka, C* and JSON
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( 
ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY) 
.map { case (_,json) => JsonParser.parse(json).extract[MonthlyCommits]} 
.saveToCassandra("github_stats","commits_aggr")

cqlsh> select * from github_stats.commits_aggr; Analytic

 
user | commits | month | year
-------------------+---------+-------+------
pkolaczk | 42 | 3 | 2015 Analytic
jacek-lewandowski | 43 | 3 | 2015
helena | 98 | 3 | 2015  Search
(3 rows)
Kafka Streaming Word Count

sparkConf.set("spark.cassandra.connection.host", "10.20.3.45") 
val streamingContext = new StreamingContext(conf, Seconds(30)) 

KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( 

streamingContext, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
.map(_._2)
.countByValue()
.saveToCassandra("my_keyspace","wordcount")
Spark Streaming, Twitter & Cassandra
CREATE TABLE IF NOT EXISTS keyspace.table ( 
topic text, interval text, mentions counter, 
PRIMARY KEY(topic, interval) 
) WITH CLUSTERING ORDER BY (interval DESC)

/** Cassandra is doing the sorting for you here. */ 

TwitterUtils.createStream(
ssc, auth, tags, StorageLevel.MEMORY_ONLY_SER_2) 
.flatMap(_.getText.toLowerCase.split("""\s+""")) 
.filter(tags.contains(_)) 
.countByValueAndWindow(Seconds(5), Seconds(5)) 
.transform((rdd, time) =>
rdd.map { case (term, count) => (term, count, now(time))}) 
.saveToCassandra(keyspace, table)
Spark MLLib

Your Data Extract Data To Analyze

Training Feature Model Model

Data Extraction Training Testing

Train your model to predict

Test
Data
Spark Streaming ML, Kafka & C*
val ssc = new StreamingContext(new SparkConf()…, Seconds(5) 

val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse) 

 
val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder]( 
ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
.map(_._2).map(LabeledPoint.parse)

trainingStream.saveToCassandra("ml_keyspace", “raw_training_data") 
 
val model = new StreamingLinearRegressionWithSGD() 
.setInitialWeights(Vectors.dense(weights)) 
.trainOn(trainingStream)

 
//Making predictions on testData
model
.predictOnValues(testData.map(lp => (lp.label, lp.features)))
.saveToCassandra("ml_keyspace", "predictions")
KillrWeather
• Global sensors & satellites collect data
• Cassandra stores in sequence
• Application reads in sequence

Apache
Cassandra
Data model should look like your queries
Queries I Need
• Get data by ID
• Get data for a single date and time
• Get data for a window of time
• Compute, store and retrieve daily, monthly, annual aggregations

Design Data Model to support queries

• Store raw data per ID

• Store time series data in order: most recent to oldest
• Compute and store aggregate data in the stream
• Set TTLs on historic data
Data Model
CREATE TABLE daily_temperature ( • Weather Station Id and Time
weather_station text,
year int, are unique
month int,
day int, • Store as many as needed
hour int,
temperature double,
PRIMARY KEY (weather_station,year,month,day,hour)
);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature)

VALUES (‘10010:99999’,2005,12,1,7,-5.6);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature)

VALUES (‘10010:99999’,2005,12,1,8,-5.1);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature)

VALUES (‘10010:99999’,2005,12,1,9,-4.9);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature)

VALUES (‘10010:99999’,2005,12,1,10,-5.3);
Load-Balanced Data Ingestion
class HttpNodeGuardian extends ClusterAwareNodeGuardianActor { 

cluster.joinSeedNodes(Vector(..))
 
context.actorOf(BalancingPool(PoolSize).props(Props(
new KafkaPublisherActor(KafkaHosts, KafkaBatchSendSize)))) 

 
Cluster(context.system) registerOnMemberUp {
context.actorOf(BalancingPool(PoolSize).props(Props(
new HttpReceiverActor(KafkaHosts, KafkaBatchSendSize))))
}

def initialized: Actor.Receive = { … }

 
}
Client: HTTP Receiver Akka Actor
class HttpDataIngestActor(kafka: ActorRef) extends Actor with ActorLogging { 
implicit val system = context.system 
implicit val askTimeout: Timeout = settings.timeout 
implicit val materializer = ActorFlowMaterializer( 
ActorFlowMaterializerSettings(system)) 
 
val requestHandler: HttpRequest => HttpResponse = { 
case HttpRequest(HttpMethods.POST, Uri.Path("/weather/data"), headers, entity, _) => 
headers.toSource collect { case s: Source => 
kafka ! KafkaMessageEnvelope[String, String](topic, group, s.data:_*) 
} 
HttpResponse(200, entity = HttpEntity(MediaTypes.`text/html`) 
}.getOrElse(HttpResponse(404, entity = "Unsupported request")) 
case _: HttpRequest => 
HttpResponse(400, entity = "Unsupported request") 
} 
 
Http(system).bind(HttpHost, HttpPort).map { case connection => 
log.info("Accepted new connection from " + connection.remoteAddress) 
connection.handleWithSyncHandler(requestHandler) }
 
def receive : Actor.Receive = { 
case e => 
} 
}
Client: Kafka Producer Akka Actor
class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor { 
 
override val supervisorStrategy = 
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) { 
case _: ActorInitializationException => Stop 
case _: FailedToSendMessageException => Restart
case _: ProducerClosedException => Restart
case _: NoBrokersForPartitionException => Escalate
case _: KafkaException => Escalate 
case _: Exception => Escalate 
}
 

private val producer = new KafkaProducer[K, V](producerConfig) 

 
override def postStop(): Unit = producer.close()
 
def receive = { 
case e: KafkaMessageEnvelope[K,V] => producer.send(e) 
} 
}
Store raw data on ingestion
Store Raw Data From Kafka Stream To C*
val kafkaStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder]
(ssc, kafkaParams, topicMap, StorageLevel.DISK_ONLY_2) 
.map(transform) 
.map(RawWeatherData(_)) 
 
/** Saves the raw data to Cassandra. */ 
kafkaStream.saveToCassandra(keyspace, raw_ws_data)

Now we can replay on failure

for later computation, etc

/** Now proceed with computations from the same stream.. */

kafkaStream…
Let’s See Our Data Model Again
CREATE TABLE weather.raw_data ( 
wsid text, year int, month int, day int, hour int,  
temperature double, dewpoint double, pressure double,
wind_direction int, wind_speed double, one_hour_precip
PRIMARY KEY ((wsid), year, month, day, hour) 
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

CREATE TABLE daily_aggregate_precip ( 

wsid text, 
year int, 
month int, 
day int, 
precipitation counter, 
PRIMARY KEY ((wsid), year, month, day) 
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Efficient Stream Computation
class KafkaStreamingActor(kafkaPm: Map[String, String], ssc: StreamingContext, ws: WeatherSettings)
extends AggregationActor { 
import settings._
 
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( 
ssc, kafkaParams, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2) 
.map(_._2.split(",")) 
.map(RawWeatherData(_)) 
 
kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw) 

/** RawWeatherData: wsid, year, month, day, oneHourPrecip */ 

kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)) 
.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip) 

 
/** Now the [[StreamingContext]] can be started. */ 
context.parent ! OutputStreamInitialized 
 
def receive : Actor.Receive = {…}
}

Gets the partition key: Data Locality Cassandra Counter column in our schema,
Spark C* Connector feeds this to Spark no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
/** For a given weather station, calculates annual cumulative precip - or year to date. */ 
class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor { 
 
def receive : Actor.Receive = { 
case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender) 
case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender) 
} 
 
/** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */ 
def cumulative(wsid: String, year: Int, requester: ActorRef): Unit = 
ssc.cassandraTable[Double](keyspace, dailytable) 
.select("precipitation") 
.where("wsid = ? AND year = ?", wsid, year) 
.collectAsync() 
.map(AnnualPrecipitation(_, wsid, year)) pipeTo requester 
 
/** Returns the 10 highest temps for any station in the `year`. */ 
def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = { 
val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year, 
ssc.sparkContext.parallelize(aggregate).top(k).toSeq) 
 
ssc.cassandraTable[Double](keyspace, dailytable) 
.select("precipitation") 
.where("wsid = ? AND year = ?", wsid, year) 
.collectAsync().map(toTopK) pipeTo requester 
} 
}
Efficient Batch Analytics
class TemperatureActor(sc: SparkContext, settings: WeatherSettings)
extends AggregationActor { 
import akka.pattern.pipe
 
def receive: Actor.Receive = { 
case e: GetMonthlyHiLowTemperature => highLow(e, sender) 
} 
 
def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit = 
sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr) 
.where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month) 
.collectAsync() 
.map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester

}
C* data is automatically sorted by most recent - due to our data model.
Additional Spark or collection sort not needed.
@helenaedelson
github.com/helena
slideshare.net/helenaedelson

99
Learn More Online and at Cassandra Summit
https://academy.datastax.com/

Dse Admin 60
No ratings yet
Dse Admin 60
1,015 pages
Azure Cosmos DB Workshop
100% (1)
Azure Cosmos DB Workshop
147 pages
Cloudera Hbase
100% (1)
Cloudera Hbase
145 pages
Big Data
0% (1)
Big Data
2 pages
DBT Unit4 PDF
No ratings yet
DBT Unit4 PDF
152 pages
Dp203 Notes
No ratings yet
Dp203 Notes
87 pages
Tomcatx Performance Tuning
No ratings yet
Tomcatx Performance Tuning
51 pages
Coursera Enterprise Catalog - Master
No ratings yet
Coursera Enterprise Catalog - Master
1,702 pages
Big Data Architectures: A Detailed and Application Oriented Review
No ratings yet
Big Data Architectures: A Detailed and Application Oriented Review
11 pages
Master Ahmed Hussnain 2014 PDF
No ratings yet
Master Ahmed Hussnain 2014 PDF
85 pages
Tutorial-HDP-Administration V III
100% (1)
Tutorial-HDP-Administration V III
274 pages
DP-200 Dump
No ratings yet
DP-200 Dump
164 pages
Four Distributed System Architectural Patterns
No ratings yet
Four Distributed System Architectural Patterns
10 pages
Risk Assessment Through Real-Time Data Analysis Using Big Data Streaming in AWS
No ratings yet
Risk Assessment Through Real-Time Data Analysis Using Big Data Streaming in AWS
79 pages
Microsoft - Certshared.dp 203.free - pdf.2023 Sep 25.by - Osborn.177q.vce
No ratings yet
Microsoft - Certshared.dp 203.free - pdf.2023 Sep 25.by - Osborn.177q.vce
24 pages
Twitrends: A Real Time Trending Topics Detection System For Twitter Social Network
No ratings yet
Twitrends: A Real Time Trending Topics Detection System For Twitter Social Network
10 pages
Big Data Architecture
No ratings yet
Big Data Architecture
41 pages
Automotive Big Data
No ratings yet
Automotive Big Data
10 pages
Hadoop Admin 171103e Exercise Manual
No ratings yet
Hadoop Admin 171103e Exercise Manual
103 pages
Linux Journal - August 2017
No ratings yet
Linux Journal - August 2017
122 pages
MapGuide Programming Manual
No ratings yet
MapGuide Programming Manual
164 pages
Cloudera Kafka PDF
No ratings yet
Cloudera Kafka PDF
175 pages
ADM203 L13 Troubleshooting
No ratings yet
ADM203 L13 Troubleshooting
19 pages
Unit 1 Introduction: Data Science and Big Data: Syllabus
No ratings yet
Unit 1 Introduction: Data Science and Big Data: Syllabus
38 pages
Real Time Analytics Spark Streaming PDF
No ratings yet
Real Time Analytics Spark Streaming PDF
20 pages
User Administration
50% (2)
User Administration
18 pages
SUSE Linux Enterprise: 10 SP1 The Linux Audit Framework
No ratings yet
SUSE Linux Enterprise: 10 SP1 The Linux Audit Framework
76 pages
Twitter Sentimental Analysis
No ratings yet
Twitter Sentimental Analysis
42 pages
Scalability and High Volume Performance of Indexer Clustering at Splunk
No ratings yet
Scalability and High Volume Performance of Indexer Clustering at Splunk
44 pages
Splunk 6.3.1 Forwarding
No ratings yet
Splunk 6.3.1 Forwarding
159 pages
DEV 301 - Lab Guide
100% (1)
DEV 301 - Lab Guide
46 pages
Dell Equallogic SAN HQ Guide v2.2
No ratings yet
Dell Equallogic SAN HQ Guide v2.2
144 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
Vcs Install 601 Aix
No ratings yet
Vcs Install 601 Aix
477 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
BK Hdfs Administration
No ratings yet
BK Hdfs Administration
73 pages
How To Configure HA Proxy Load Balancer With EFT Server HA Cluster
No ratings yet
How To Configure HA Proxy Load Balancer With EFT Server HA Cluster
8 pages
Freeipa 1.2.1 Administration Guide: Ipa Solutions From The Ipa Experts
No ratings yet
Freeipa 1.2.1 Administration Guide: Ipa Solutions From The Ipa Experts
48 pages
CIS IBM DB2 9 Benchmark v2.0.0
No ratings yet
CIS IBM DB2 9 Benchmark v2.0.0
176 pages
Ossec in The Enterprise Final LR
No ratings yet
Ossec in The Enterprise Final LR
129 pages
Red Hat Virtualization-4.4-Planning and Prerequisites Guide-En-US
No ratings yet
Red Hat Virtualization-4.4-Planning and Prerequisites Guide-En-US
36 pages
Databricks DBX CLI - Deploy The Spark JAR Using YAML - by Ganesh Chandrasekaran - Medium
No ratings yet
Databricks DBX CLI - Deploy The Spark JAR Using YAML - by Ganesh Chandrasekaran - Medium
7 pages
AIX Performance Cmds
No ratings yet
AIX Performance Cmds
4 pages
SAS Hadoop Kerberos
No ratings yet
SAS Hadoop Kerberos
27 pages
NetBackup102 WebUIGuide MySQLAdmin
No ratings yet
NetBackup102 WebUIGuide MySQLAdmin
38 pages
Postgresql Performance Tuning: Ruce Omjian
No ratings yet
Postgresql Performance Tuning: Ruce Omjian
61 pages
Week4 - Data Formats and Streaming Data Quiz
No ratings yet
Week4 - Data Formats and Streaming Data Quiz
6 pages
Architecture Best Practices
No ratings yet
Architecture Best Practices
27 pages
Architecture Patterns of Analytics and Big Data
No ratings yet
Architecture Patterns of Analytics and Big Data
12 pages
Couchbase Manual 1.8
No ratings yet
Couchbase Manual 1.8
162 pages
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
DZone ScyllaDB Database Systems Trend Report
No ratings yet
DZone ScyllaDB Database Systems Trend Report
49 pages
Red Hat Satellite 6.2 ArchitectureGuide
100% (1)
Red Hat Satellite 6.2 ArchitectureGuide
35 pages
Ceph
No ratings yet
Ceph
40 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
20 pages
6-68046-01 SN 5 TuningGuide Rev
No ratings yet
6-68046-01 SN 5 TuningGuide Rev
118 pages
DLJ 132
No ratings yet
DLJ 132
100 pages
EMC® Unisphere™ For VMAX® PDF
No ratings yet
EMC® Unisphere™ For VMAX® PDF
56 pages
SPA Notes
No ratings yet
SPA Notes
4 pages
Architecting Splunk For High Availability and Disaster Recovery
No ratings yet
Architecting Splunk For High Availability and Disaster Recovery
47 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Artifactory-Configuring With Jenkins
No ratings yet
Artifactory-Configuring With Jenkins
9 pages
Logsene Brochure PDF
No ratings yet
Logsene Brochure PDF
24 pages
Data Onboarding From Scratch
No ratings yet
Data Onboarding From Scratch
51 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
221 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
Log Stash
No ratings yet
Log Stash
41 pages
Splunk 7.0.0 Metrics Metrics 2
No ratings yet
Splunk 7.0.0 Metrics Metrics 2
29 pages
Protecting Business-Critical Applications in A Vmware Infrastructure 3 Environment Using Veritas™ Cluster Server For Vmware Esx
No ratings yet
Protecting Business-Critical Applications in A Vmware Infrastructure 3 Environment Using Veritas™ Cluster Server For Vmware Esx
18 pages
Ansible 2
No ratings yet
Ansible 2
15 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
01-Docker - 02 - Install Docker Desktop On Windows
No ratings yet
01-Docker - 02 - Install Docker Desktop On Windows
6 pages
Cs498 Week 12 Slide
No ratings yet
Cs498 Week 12 Slide
100 pages
3
No ratings yet
3
2 pages
HP Service Guard Cluster
No ratings yet
HP Service Guard Cluster
2 pages
k8s HELM
No ratings yet
k8s HELM
18 pages
Pervasive and Mobile Computing: A.R. Al-Ali, Ragini Gupta, Imran Zualkernan, Sajal K. Das
No ratings yet
Pervasive and Mobile Computing: A.R. Al-Ali, Ragini Gupta, Imran Zualkernan, Sajal K. Das
29 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
Architecting Data Intensive Applications 1st Edition Anuj Kumar - Own The Complete Ebook Set Now in PDF and DOCX Formats
100% (3)
Architecting Data Intensive Applications 1st Edition Anuj Kumar - Own The Complete Ebook Set Now in PDF and DOCX Formats
71 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Cloud Computing Applications Part 2 Final
No ratings yet
Cloud Computing Applications Part 2 Final
79 pages
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
No ratings yet
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
234 pages
System Design For Data Engineering
No ratings yet
System Design For Data Engineering
83 pages
IBM Integration Bus Third Edition
From Everand
IBM Integration Bus Third Edition
Gerardus Blokdyk
No ratings yet
Splunk Punk: Taming Logs, Alerts, and the Chaos of SIEM
From Everand
Splunk Punk: Taming Logs, Alerts, and the Chaos of SIEM
Scott Markham
No ratings yet
Mastering Active Directory
From Everand
Mastering Active Directory
VICTOR P HENDERSON
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
VMware Horizon View Essentials
From Everand
VMware Horizon View Essentials
Peter von Oven
No ratings yet
Extending Puppet - Second Edition
From Everand
Extending Puppet - Second Edition
Alessandro Franceschi
No ratings yet
Learning SaltStack - Second Edition
From Everand
Learning SaltStack - Second Edition
Colton Myers
No ratings yet

Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF

Uploaded by

Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF

Uploaded by

Lambda Architecture with Spark

Streaming, Kafka, Cassandra, Akka, Scala

• Spark Cassandra Connector committer

• Sr Cloud Engineer, VMware,CrowdStrike,SpringSource…

What Lambda Architecture & Delivering Meaning

Scalable Infrastructure / Elastic Spark, Cassandra, Kafka

© 2014 DataStax, All Rights Reserved Company

• Support in several languages

• Integrated SQL Queries

• Has Generalized DAG execution

All in one package

def topK(aggregate: Seq[Double]): Seq[Double] =

Returns the top (k) highest temps … in a Future Analytic

def topK(aggregate: Seq[Double]): Future[Seq[Double]] =

scala> val data = Array(1, 2, 3, 4, 5)

© 2014 DataStax, All Rights Reserved Company Confidential

.map(word => (word.toLowerCase, 1))

DSE > 4.5.0

When Batch Is Not Enough Analytic

Both in same app = Lambda

• I want results continuously in the event stream

RDD (time 0 to time 1) RDD (time 1 to time 2) RDD (time 2 to time 3)

A transformation on a DStream = transformations on its RDDs

Window Length: Sliding Interval:

ABD ABC ABD ABC

ACD BCD ACD BCD

© 2014 DataStax, All Rights Reserved Company

ABD ABC ABD ABC

ACD BCD ACD BCD

© 2014 DataStax, All Rights Reserved Company

Did you hear node 1

INSERT INTO users (username, firstname, lastname,

C* Clustering Columns Writes by most recent

CREATE TABLE IF NOT EXISTS requests_ks.timeline (

timeuuid protects from simultaneous events over-writing one another.

val multipleStreams = (1 to numDstreams).map { i =>

Spark RDD JOIN with NOSQL!

SparkContext Keyspace Table

val rdd = sc.cassandraTable("github", "commits")

StreamingContext CassandraRow Keyspace Table

val rdd = ssc.cassandraTable[MonthlyCommits]("github", "commits_aggregate")

val rdd = ssc.cassandraTable[MyDataType]("stats", "clustering_time")

val rdd = ssc.cassandraTable[(Int,DateTime,String)]("stats", "clustering_time")

CREATE TYPE address ( UDT = Your Custom Field Type In Cassandra

SELECT "key", "value" C*

All of the rows in a Cassandra Cluster

New York City/

Warsaw: St. Petersburg:

San Francisco: Each of the Nodes in a

New York City

Warsaw St. Petersburg

Each of the Nodes in a

New York City

Jacek 514 Red

Warsaw St. Petersburg

The CQL Schema designates

New York City

Helena 514 Red

Warsaw St. Petersburg

The hash of the Partition Key

SELECT * FROM keyspace.table WHERE

The C* Driver pages spark.cassandra.input.page.row.size

SELECT * FROM keyspace.table

val cc = new CassandraSQLContext(sparkContext)

val json = Seq(

cqlsh> select * from github_stats.commits_aggr; Analytic

KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](

/** Cassandra is doing the sorting for you here. */

Your Data Extract Data To Analyze

Training Feature Model Model

Train your model to predict

val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse)

Design Data Model to support queries

• Store raw data per ID

INSERT INTO temperature(weather_station,year,month,day,hour,temperature)

INSERT INTO temperature(weather_station,year,month,day,hour,temperature)

scala> val data = Array(1, 2, 3, 4, 5) 

SELECT "key", "value"   C*

San Francisco: Each of the Nodes in a  

Each of the Nodes in a  

val json = Seq( 

KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( 

/** Cassandra is doing the sorting for you here. */ 

val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse) 

private val producer = new KafkaProducer[K, V](producerConfig) 

CREATE TABLE daily_aggregate_precip ( 

/** RawWeatherData: wsid, year, month, day, oneHourPrecip */