SQLin Big Data
SQLin Big Data
net/publication/311488672
CITATIONS READS
55 15,271
3 authors, including:
Michell Queiroz
University of Antwerp
5 PUBLICATIONS 79 CITATIONS
SEE PROFILE
All content following this page was uploaded by Michell Queiroz on 14 April 2020.
ABSTRACT Standardization (ISO); and (4) it is, to some extent, portable among
The Structured Query Language (SQL) is the main programing different database systems. Even though SQL was initially
language designed to manage data stored in database systems. proposed for traditional relational databases, it has also been found
While SQL was initially used only with relational database to be an effective language in several new types of database
management systems (RDBMS), its use has been significantly systems, particularly Big Data management systems (BDMSs) that
extended with the advent of new types of database systems. process large, fast and diverse data. While BDMSs have
Specifically, SQL has been found to be a powerful query language significantly changed the database landscape and traditional
in highly distributed and scalable systems that process Big Data, RDBMs represent now only a small fraction of it, most books and
i.e., datasets with high volume, velocity and variety. While database courses only present SQL in the context of RDBMs.
traditional relational databases represent now only a small fraction In this paper we propose learning SQL as a powerful language that
of the database systems landscape, most database courses that cover can be used to query a wide range of database systems, from
SQL consider only the use of SQL in the context of traditional traditional relational databases to modern Big Data systems. The
relational systems. In this paper, we propose teaching SQL as a main contributions of this paper are:
general language that can be used in a broad range of database
systems from traditional RDBMSs to Big Data systems. This paper The description of the new database landscape and the
presents well-structured guidelines to introduce SQL in the context identification of broad types of Big Data systems where SQL
of new types of database systems including MapReduce, NoSQL can be effectively used.
and NewSQL. A key contribution of this paper is the description of The presentation of well-structured guidelines to prepare course
an array of course resources, e.g., virtual machines, sample units that focus on the study of SQL on new types of database
projects, and in-class exercises, to enable a hands-on experience systems including MapReduce, NoSQL and NewSQL.
with SQL across a broad set of modern database systems. A detailed description of class resources for each unit to enable
a hands-on experience with SQL across a broad set of modern
Categories and Subject Descriptors database systems. These resources include virtual machines,
K.3.2 [Computers and Education]: Computer and Information
data generators, sample programs, projects, and in-class
Science Education
exercises.
General Terms Files of all the resources made available to instructors [1]. The
Design, Experimentation goal is to enable instructors to use and extend these resources
based on their specific needs.
Keywords The rest of the paper is organized as follows. Section 2 describes
Databases curricula; SQL; structured query language; Big Data
the new database landscape and the types of BDMSs where SQL
1. INTRODUCTION can be used. Sections 3, 4, and 5 present the details of how SQL
The Structured Query Language (SQL) is the most extensively used can be introduced in the context of the new types of database
database language. SQL is composed of a data definition language systems (MapReduce, NoSQL, and NewSQL). These sections
(DDL), which allows the specification of database schemas; a data describe many class resources which will be made available in [1].
manipulation language (DML), which supports operations to Section 6 presents a discussion of the integration of the proposed
retrieve, store, modify and delete data; and a data control language units into database courses. Section 7 concludes the paper.
(DCL), which enables database administrators to configure security
access to databases. Among the most important reasons for SQL’s
2. NEW DATABASE LANDSCAPE
The development and widespread use of highly distributed and
wide adoption are that: (1) it is primarily a declarative language,
scalable systems to process Big Data is considered as one of the
that is, it specifies the logic of a program (what needs to be done)
recent key technological developments in computer science [3, 4,
instead of the control flow (how to do it); (2) it is relatively simple
5, 6]. These systems have significantly extended the database
to learn and understand because it is declarative and uses English
landscape and are currently used in many application scenarios,
statements; (3) it is a standard of the American National Standards
e.g., social media data mining, scientific data analysis,
Institute (ANSI) and the International Organization for
recommendation systems, and web service analysis. BDMSs are
usually composed of clusters of commodity machines and are often
Permission to make digital or hard copies of all or part of this work for
dynamically scalable, i.e., nodes can be added or removed as
personal or classroom use is granted without fee provided that copies are needed. The area of BDMSs has quickly developed and there are
not made or distributed for profit or commercial advantage and that copies now dozens of BDMS-related open-source and commercial
bear this notice and the full citation on the first page. To copy otherwise, products. Some of these systems have proposed their own query
or republish, to post on servers or to redistribute to lists, requires prior languages or application program interfaces but several others have
specific permission and/or a fee. recognized the benefits of using SQL and support it as a query
SIGCSE’16, March 2–5, 2016, Memphis, Tennessee, USA. language. The use of SQL in DBMSs is expanding and some top
Copyright 2016 ACM 1-58113-000-0/00/0010 …$15.00.
database researchers believe that many more systems will move to 3.1 Hive: SQL Queries on Hadoop
support SQL [2]. The authors propose that SQL should be taught in Apache Hive [8] is a system that supports the processing and
the context of the following types of BDMSs in addition to analysis of data stored in Hadoop. Hive allows projecting structure
traditional RDBMSs: onto this data and querying the data using HiveQL, a SQL-like
MapReduce. It is considered to be one of the main frameworks query language. One of the key features of Hive is that it
for Big Data processing. It enables building highly distributed transparently converts queries specified in HiveQL to MapReduce
programs that run on failure-tolerant and scalable clusters. programs. This enables the user to focus on specifying what the
Apache Hadoop [7] is the most widely used MapReduce query should retrieve instead of specifying a procedural
implementation. Examples of systems that support SQL to MapReduce program in Java. Hive uses indexing structures to
query data in Hadoop are: Hive [8] and Spark (Spark SQL) [9]. accelerate the execution of queries and was particularly built to
support data warehouse applications, which require the analysis of
NoSQL. These data stores have been designed to provide primarily read-only massive datasets (data that does not change
higher scalability and availability than conventional relational over time). While HiveQL is Hive’s main query language, Hive
databases while supporting a simplified transaction and also allows the use of custom map and reduce functions when this
consistency model. Some examples of NoSQL data stores are: is a more convenient or efficient way to express a given query logic.
MongoDB [10], Apache HBase [11], Google’s BigTable [12],
and Apache Cassandra [13]. While many NoSQL systems do HiveQL supports many of the features of SQL but it does not
not support SQL natively, several systems have been proposed strictly follow a full SQL standard. Hive supports multiple DDL
to enable SQL querying on these systems. Examples of such and DML commands such as CREATE TABLE, SELECT,
systems are: Impala [14], Presto [15] and SlamData [16]. INSERT, UPDATE and DELETE. Moreover, starting with Hive
NewSQL. These systems aim to have the same levels of 0.13, it is possible to support transactions with full ACID semantics
scalability and availability of NoSQL systems while at the row (record) level.
maintaining the ACID properties (Atomicity, Consistency, We propose the following in-class activity to introduce HiveQL.
Isolation and Durability), relational schema, and SQL query
language of traditional relational databases. Examples of Using Virtual Machines. Since many Big Data systems rely on
NewSQL systems are: VoltDB [17], MemSQL [18], NuoDB distributed architectures, a computer cluster is needed to enable a
[19], and Clustrix [20]. direct interaction with these technologies. Many institutions,
however, do not have such clusters available for teaching. A
3. SQL IN MAPREDUCE SYSTEMS solution that the authors recommend is the use of virtual machines
MapReduce is an extensively used programming framework for (VM) with all the required packages already installed and
processing very large datasets. Apache Hadoop is its most popular configured. In the case of Hive, the authors recommend the use of
implementation. A MapReduce program divides a large dataset into Cloudera’s VM [23] which includes: (1) CentOS Linux as the
independent chunks that can be processed in a parallel fashion over operating system, (2) Hadoop, (3) Hive, and (5) Hue, a web-based
dynamic computer clusters. The overall processing is divided into application that can be used to write and run HiveQL queries.
two main phases: map and reduce, and the framework user needs
to provide the code for each phase. The map and reduce functions Data Generators and Datasets. The queries presented in this
have the following form: section use MStation2, a synthetic dataset of meteorological station
data prepared by the authors. The dataset has two tables: Station
map: (k1,v1) → list(k2,v2) (stationID, zipcode, latitude, longitude) and WeatherReport
reduce: (k2,list(v2)) → list(k3,v3) (stationID, temperature, precipitation, humidity, year, month). The
The input dataset is usually stored on an underlying distributed file data generator and a sample dataset are available in [1]. The
system and different chunks of this dataset are processed in parallel generator can be modified to produce datasets of different sizes and
by different map tasks. Each map task processes an input chunk one data distributions for additional exercises or projects.
record or line at the time. Each map function call receives a key- Loading and Querying the Data. The commands for this activity
value pair (k1,v1) and generates a list of (k2,v2) pairs. Each are listed in Fig. 1. To start interacting with Hive, students can open
generated key-value pair is sent over the network to a reduce node a terminal window using Cloudera’s VM and start the Hive console
(shuffle phase). The framework guarantees that all the intermediate using command C1. Students can then create the database MStation
pairs with the same key (k2) will be sent to the same reduce node (C2) and use the DDL commands C3 and C4 to create tables
where they will form a single group. Each reduce call processes a stationData and weatherReport, respectively. Students can then
group (k2,list(v2)) and outputs a list of (k3,v3) pairs, which load the data generated by the MStation2 data generator into both
represent the overall output of the MapReduce job. tables using commands C5 and C6. Both, the Hive console and the
While MapReduce is a powerful framework to build highly Hue application, can be used next to run SQL queries. Students can
distributed and scalable programs, it is also complex and difficult use the available tables to write queries such as (Q1-Q4 in Fig. 1):
to learn. In fact, even simple data operations, like joining two Q1: For each zipcode, compute the average precipitation level.
datasets or identifying the top-K records, require relatively
complex MapReduce programs. This is the case because Q2: Group the data by station and output the average humidity
MapReduce requires users to build a program using a procedural level at each station.
language that needs a detailed specification of how a processing Q3: Identify the minimum and maximum temperatures reported
task should be carried out. Considering this limitation, several by each station considering years greater than 2000.
systems have been proposed to: (1) enable the use of SQL-like Q4: Compute the tenth highest temperatures ever reported. List
languages on top of MapReduce-based systems, e.g., Apache Hive the temperature, zip code, month and year.
[8] and Apache Pig [22], and (2) integrate SQL with MapReduce-
based computations, e.g., Spark SQL [9]. Fig. 2 shows the result of queries using the Hive console and Hue.
C1: sudo hive C1: wget http://real‐chart.finance.yahoo.com/table.csv?
s=AAPL&d=6&e=4&f=2015&g=d&a=11&b=12&c=1980&ignore=.csv
C2: CREATE database MStation;
C2: mv table.csv?s=AAPL table.csv
C3: CREATE TABLE stationData (stationed int, zipcode int,
latitude double, longitude double, stationname string); C3: hadoop fs ‐put ./table.csv /data/
C4: CREATE TABLE weatherReport (stationid int, temp double, C4: spark‐shell ‐‐master yarn‐client ‐‐driver‐memory 512m ‐‐
humi double, precip double, year int, month string); executor‐memory 512m
C5: LOAD DATA LOCAL INPATH '/home/cloudera/datastation/ C5: import org.apache.spark.sql._
stationData.txt' into table stationData;
C6: val base_data = sc.textFile("hdfs://sandbox.hortonworks.
C6: LOAD DATA LOCAL INPATH '/home/cloudera/datastation/ com:8020/data/table.csv")
weatherReport.txt' into table weatherReport;
C7: val attributes = base_data.first
Q1: SELECT S.zipcode, AVG(W.precip) FROM stationData S JOIN
C8: val data = apple_stocks.filter(_(0) != attributes(0))
weatherReport W ON S.stationid = W.stationid GROUP BY
S.zipcode; C9: case class AppleStockRecord(date: String, open: Float,
high: Float, low: Float, close: Float, volume: Integer,
Q2: SELECT stationid, AVG(humi) FROM weatherReport GROUP BY
adjClose: Float)
stationid;
C10: val applestock = data.map(_.split(",")).map(row =>
Q3: SELECT stationid, MIN(temp), MAX(temp) FROM weatherReport
AppleStockRecord(row(0), row(1).trim.toFloat,
WHERE year > 2000 GROUP BY stationid;
row(2).trim.toFloat, row(3).trim.toFloat, row(4).trim.toFloat,
Q4: SELECT S.zipcode, W.temp, W.month, W.year FROM row(5).trim.toInt, row(6).trim.toFloat,
stationData S JOIN weatherReport W ON S.stationid = row(0).trim.substring(0,4).toInt)).toDF()
W.stationid ORDER BY W.temp DESC LIMIT 10;
C11: applestock.registerTempTable("applestock")
Figure 1. Hive Commands and Queries C12: applestock.show
C13: applestock.count
C14: output.map(t => "Record: " + t.toString).collect().
foreach(println)
Q1: val output = sql("SELECT * FROM applestock WHERE close >=
open")
Q2: val output = sql("SELECT MAX(close‐open) FROM applestock")
Q3: val output = sql("SELECT date, high FROM applestock ORDER
BY high DESC LIMIT 10")
Q4: val output = sql("SELECT year, AVG(Volume) FROM applestock
WHERE year > 1999 GROUP BY year")