Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
1
Who Is This Person?
[email protected]
I need fast access
to historical data
on the fly for
predictive modeling
with real time data
from the stream
Lambda Architecture
A data-processing architecture designed to handle massive quantities of
data by taking advantage of both batch and stream processing methods.
• Spark is one of the few data processing frameworks that allows you to
seamlessly integrate batch and stream processing
• Of petabytes of data
• In the same application
Your Code
Moving Data Between Systems Is
Difficult Risky and Expensive
@helenaedelson 9
How Do We Approach This?
Strategies
• Scalable Infrastructure
• Partition For Scale
• Replicate For Resiliency
• Share Nothing
• Asynchronous Message Passing
• Parallelism
• Isolation
• Data Locality
• Location Transparency
My Nerdy Chart
Strategy Technologies
"During Hurricane Sandy, we lost an entire data center. Completely. Lost. It.
Our data in Cassandra never went offline."
• Streaming
• Machine Learning
• Graph
Search
Apache Spark - Easy to Use API
Returns the top (k) highest temps for any location in the year
Search
Use the Spark Shell to
quickly try out code samples
Available in
Spark Shell
and
Pyspark
Collection To RDD
Analytic
Search
Not Just MapReduce
sc.textFile(words)
.flatMap(_.split("\\s+")) Analytic
Search
RDDs Can be Generated from a
Variety of Sources
Scala Collections
Textfiles
RDD Operations
Analytic
Transformation
Analytic
Action Search
Setting up C* and Spark
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
Analytic
Analytic
Analytic
Search
Your Data Is Like Candy
Delicious: you want it now
Streaming Analytics
Batch Analytics Analytics as data arrives.
Analysis after data has accumulated The data won’t be stale and neither will our analytics
Decreases the weight of the data by the time it is processed
Analytic
Analytic
Search
DStream
ssc.textFileStream("s3n://raw_data_bucket/")
.flatMap(_.split("\\s+"))
.map(_.toLowerCase, 1)) The batch streaming interval
.countByValue()
.saveToCassandra(keyspace,table)
ssc.checkpoint(checkpointDir)
ssc.start() Starts the streaming application piping
ssc.awaitTermination raw incoming data to a Sink
ReceiverInputDStreams
DStreams - the stream of raw data received from streaming sources:
• Basic Source - in the StreamingContext API
• Advanced Source - in external modules and separate Spark artifacts
Receivers
• Reliable Receivers - for data sources supporting acks (like Kafka)
• Unreliable Receivers - for data sources not supporting acks
39
Spark Streaming External Source/Sink
Streaming Window Operations
kvStream
.flatMap { case (k,v) => (k,v.value) }
.reduceByKeyAndWindow((a:Int,b:Int) =>
(a + b), Seconds(30), Seconds(10))
.saveToCassandra(keyspace,table)
ReplicationFactor=3
C D
US-East Europe
ReplicationFactor=3
US-East C D
Europe
Cassandra Cluster
•- Bursty traffic
•- Volume of data from sensors requires
• - Very large trigger and data acquisition system
• - 30,000 applications on 2,000 nodes
Genetics / Biological Computations
IoT
CQL - Easy
CREATE TABLE users (
username varchar,
• Familiar syntax
firstname varchar, • Many Tools & Drivers
lastname varchar,
email list<varchar>, • Many Languages
password varchar, • Friendly to programmers
created_date timestamp,
PRIMARY KEY (username) • Paxos for locking
);
Cassandra will automatically sort by most recent for both write and read
A record of every event, in order in which it happened, per URL:
streamingContext.union(multipleStreams)
.map { httpRequest => TimelineRequestEvent(httpRequest)}
.saveToCassandra("requests_ks", "timeline")
Spark Cassandra Connector
@helenaedelson 59
Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
•NOSQL JOINS!
•Write & Read data between Spark and Cassandra
•Compatible with Spark 1.3
•Handles Data Locality for Speed
•Implicit type conversions
•Server-Side Filtering - SELECT, WHERE, etc.
•Natural Timeseries Integration
Spark Cassandra Connector
Spark Executor
User Application
Spark-Cassandra Connector
C* Driver
C*
Cassandra C* C*
C*
Writing and Reading
SparkContext
import
com.datastax.spark.connector._
StreamingContext
import
com.datastax.spark.connector.streaming._ Analytic
Search
Write from Spark to Cassandra
SparkContext Keyspace Table
sc.parallelize(Seq(0,1,2)).saveToCassandra(“keyspace”, "raw_data")
Analytic
predictionsRdd.join(music).saveToCassandra("music",
"predictions")
Read From C* to Spark
CassandraRDD[CassandraRow]
Every Spark task uses a CQL-like query to fetch data for the given token range:
750 - 99
350 - 749
Spark Executor
@helenaedelson 77
Spark SQL with Cassandra
import org.apache.spark.sql.cassandra.CassandraSQLContext
// write
sql.jsonRDD(json)
.map(CommitStats(_))
.flatMap(compute)
.saveToCassandra("stats","monthly_commits")
// read
val rdd = sc.cassandraTable[MonthlyCommits]("stats","monthly_commits")
Spark Streaming, Kafka, C* and JSON
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
.map { case (_,json) => JsonParser.parse(json).extract[MonthlyCommits]}
.saveToCassandra("github_stats","commits_aggr")
sparkConf.set("spark.cassandra.connection.host", "10.20.3.45")
val streamingContext = new StreamingContext(conf, Seconds(30))
Test
Data
Spark Streaming ML, Kafka & C*
val ssc = new StreamingContext(new SparkConf()…, Seconds(5)
trainingStream.saveToCassandra("ml_keyspace", “raw_training_data")
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(weights))
.trainOn(trainingStream)
//Making predictions on testData
model
.predictOnValues(testData.map(lp => (lp.label, lp.features)))
.saveToCassandra("ml_keyspace", "predictions")
KillrWeather
• Global sensors & satellites collect data
• Cassandra stores in sequence
• Application reads in sequence
Apache
Cassandra
Data model should look like your queries
Queries I Need
• Get data by ID
• Get data for a single date and time
• Get data for a window of time
• Compute, store and retrieve daily, monthly, annual aggregations
cluster.joinSeedNodes(Vector(..))
context.actorOf(BalancingPool(PoolSize).props(Props(
new KafkaPublisherActor(KafkaHosts, KafkaBatchSendSize))))
Cluster(context.system) registerOnMemberUp {
context.actorOf(BalancingPool(PoolSize).props(Props(
new HttpReceiverActor(KafkaHosts, KafkaBatchSendSize))))
}
/** Now the [[StreamingContext]] can be started. */
context.parent ! OutputStreamInitialized
def receive : Actor.Receive = {…}
}
Gets the partition key: Data Locality Cassandra Counter column in our schema,
Spark C* Connector feeds this to Spark no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
/** For a given weather station, calculates annual cumulative precip - or year to date. */
class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {
def receive : Actor.Receive = {
case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)
case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
}
/** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */
def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =
ssc.cassandraTable[Double](keyspace, dailytable)
.select("precipitation")
.where("wsid = ? AND year = ?", wsid, year)
.collectAsync()
.map(AnnualPrecipitation(_, wsid, year)) pipeTo requester
/** Returns the 10 highest temps for any station in the `year`. */
def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
ssc.sparkContext.parallelize(aggregate).top(k).toSeq)
ssc.cassandraTable[Double](keyspace, dailytable)
.select("precipitation")
.where("wsid = ? AND year = ?", wsid, year)
.collectAsync().map(toTopK) pipeTo requester
}
}
Efficient Batch Analytics
class TemperatureActor(sc: SparkContext, settings: WeatherSettings)
extends AggregationActor {
import akka.pattern.pipe
def receive: Actor.Receive = {
case e: GetMonthlyHiLowTemperature => highLow(e, sender)
}
def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =
sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)
.where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)
.collectAsync()
.map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester
}
C* data is automatically sorted by most recent - due to our data model.
Additional Spark or collection sort not needed.
@helenaedelson
github.com/helena
slideshare.net/helenaedelson
99
Learn More Online and at Cassandra Summit
https://academy.datastax.com/