Sparksql
Sparksql
[edit]
Spark SQL is a component on top of Spark Core that introduced a data abstraction called
DataFrames,[a] which provides support for structured and semi-structured data. Spark SQL
provides a domain-specific language (DSL) to manipulate DataFrames in Scala, Java, Python
or .NET.[16] It also provides SQL language support, with command-line interfaces and
ODBC/JDBC server. Although DataFrames lack the compile-time type-checking afforded by
RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well.
import org.apache.spark.sql.SparkSession
val df = spark
.read
.format("jdbc")
.option("url", url)
.option("dbtable", "people")
.load()
//df.createOrReplaceTempView("people")
Spark Streaming
[edit]
Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD transformations on those mini-batches of
data. This design enables the same set of application code written for batch analytics to be
used in streaming analytics, thus facilitating easy implementation of lambda architecture.[19]
[20] However, this convenience comes with the penalty of latency equal to the mini-batch
duration. Other streaming data engines that process event by event rather than in mini-
batches include Storm and the streaming component of Flink.[21] Spark Streaming has
support built-in to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.
[22]
In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that has
a higher-level interface is also provided to support streaming.[23]
Spark can be deployed in a traditional on-premises data center as well as in the cloud.[24]