0% found this document useful (0 votes)
5 views2 pages

Sparksql

Spark SQL is a component of Spark Core that introduces DataFrames for structured and semi-structured data manipulation using a domain-specific language in various programming languages. It supports SQL queries and integrates with databases via JDBC. Spark Streaming allows for real-time data processing using mini-batches, enabling the same application code for both batch and streaming analytics, with additional support for various data sources.

Uploaded by

derkuzesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views2 pages

Sparksql

Spark SQL is a component of Spark Core that introduces DataFrames for structured and semi-structured data manipulation using a domain-specific language in various programming languages. It supports SQL queries and integrates with databases via JDBC. Spark Streaming allows for real-time data processing using mini-batches, enabling the same application code for both batch and streaming analytics, with additional support for various data sources.

Uploaded by

derkuzesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

park SQL

[edit]

Spark SQL is a component on top of Spark Core that introduced a data abstraction called
DataFrames,[a] which provides support for structured and semi-structured data. Spark SQL
provides a domain-specific language (DSL) to manipulate DataFrames in Scala, Java, Python
or .NET.[16] It also provides SQL language support, with command-line interfaces and
ODBC/JDBC server. Although DataFrames lack the compile-time type-checking afforded by
RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well.

import org.apache.spark.sql.SparkSession

val url = "jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword" //


URL for your database server.

val spark = SparkSession.builder().getOrCreate() // Create a Spark session object

val df = spark

.read

.format("jdbc")

.option("url", url)

.option("dbtable", "people")

.load()

df.printSchema() // Looks at the schema of this DataFrame.

val countsByAge = df.groupBy("age").count() // Counts people by age

//or alternatively via SQL:

//df.createOrReplaceTempView("people")

//val countsByAge = spark.sql("SELECT age, count(*) FROM people GROUP BY age")

Spark Streaming

[edit]
Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD transformations on those mini-batches of
data. This design enables the same set of application code written for batch analytics to be
used in streaming analytics, thus facilitating easy implementation of lambda architecture.[19]
[20] However, this convenience comes with the penalty of latency equal to the mini-batch
duration. Other streaming data engines that process event by event rather than in mini-
batches include Storm and the streaming component of Flink.[21] Spark Streaming has
support built-in to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.
[22]

In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that has
a higher-level interface is also provided to support streaming.[23]

Spark can be deployed in a traditional on-premises data center as well as in the cloud.[24]

You might also like