OverviewofBigSQL3.0Functionality
OverviewofBigSQL3.0Functionality
net/publication/264604883
CITATIONS READS
0 2,202
1 author:
Maria Schwenger
IBM
8 PUBLICATIONS 1 CITATION
SEE PROFILE
All content following this page was uploaded by Maria Schwenger on 29 September 2014.
Big Data is derived from multiple sources. It involves not just traditional relational data, but all paradigms
of unstructured data sources that are growing at a significant rate. For instance, machine-derived data multiplies
quickly and contains rich, diverse content that needs to be discovered. Another example, human-derived data from
social media is more textual, but the valuable insight are often overloaded with many possible meanings. Trying to
efficiently extract the meaningful insights from such data sources quickly and easily is challenging.[6]
Many people believe that the Hadoop technology is the best answer to those challenges. Hadoop is an open
source platform, which many vendors extend in search of robust additional functionality. [4] IBM InfoSphere®
BigInsightsTM (BigInsights) is IBM's approach to enrich Hadoop with enterprise class capabilities. The latest release
of BigInsights - 3.0, provides a single platform to manage all data types, and obtain insights that integrates the usual
silos of data. With such integration, BigInsights is a viable platform for storage, analysis and visualization of wide
variety of data sources.[5] Several new features of BigInsights 3.0 concentrate on integration and workload
optimizations that bring improvements for storing, processing and querying data. Big SQL 3.0 is one of those
capabilities. (See Figure 1)
Big SQL 3.0 achieves it’s capability by leveraging IBM’s leading relational SQL expertise. Big SQL 3.0
introduces IBM SQL compiler and runtime engine built on IBM’s expertise in shared-nothing, low-latency parallel
processing database architecture. It has the advantage to deploy directly to the Hadoop cluster and access that data
natively while still allowing the data to remain in its original state on the Hadoop Distributed File System (HDFS).[4]
It provides the ability to access custom data storage formats through a Java I/O module. It provides a logical view of
the data where users use the Hive database catalog (HCatalog) for table definitions, metadata, location, storage
format and encoding of input files.
There are four main Big SQL 3.0 functionality pillars built and run upon the Hadoop cluster: rich SQL
language support, performance, application portability and integration, and enterprise capabilities.[1]
Let’s review briefly the four main pillars of Big SQL 3.0 grouped by functional areas.
As the name suggests, the first focus of Big SQL 3.0 is the rich SQL support that leverages IBM’s SQL
capability layer to provide the most complete ANSI SQL coverage. It stretches over wide variety of standard SQL
relational functionality that provides support for: variety of data types (tinyint, smallint, int, bigint, boolean, float,
double, string, varchar, and others); standard APIs and common client drivers (Java Database Connectivity (JDBC)
or Object Database Connectivity (ODBC)); procedural language structures such as in DB2 SQL PL or user defined
functions in Java or C++. It has built-in statistical, analytical, and business intelligent functions that encapsulate
SQL querying with very few modifications. It is designed for large-scale analytic aggregation and supports some of
the main OLAP functions like LEAD, LAG, RANK, DENSE_RANK, CORRELATION, STDDEV, REGR_AVGX,
and others. Big SQL 3.0 comes also equipped with an extensive library of more than 250 built-in functions and 25
aggregates.
Big SQL 3.0 is a simple, but comprehensive way to apply SQL to existing data processed by Hadoop. It
provides a logical view of the data that uses Hive Catalog (HCatalog) for table definitions, metadata, location,
storage format and encoding of input files. When creating a table, HCatalog definition and Hive APIs get stored in
HDFS warehouse directory within its designated schema. Hive's metastore catalogs table definitions could be read
and/or write, and they are shared across the Hadoop ecosystem - they can be accessed by other components (such as
Pig, Sqoop, etc.) throughout the Hadoop cluster.[2] To facilitate query execution, the HCatalog stores some of its
metadata locally on each node. Based on this, when in the parallel processing mode, the server connects to each
specific node that contains the partitioned query-able data and generates a result set of stored data within this
specific node of the cluster. The query execution also stores intermediate data into the Big SQL 3.0 engine memory
for any future or additional processing. (See Figure 2)
Although the rich SQL support is the essence of the BIG SQL 3.0, there are other very important features
that compliment and raise the value of the SQL functionality.
The second functionality pillar is application portability and integration within BigInsights – Big SQL 3.0.
This feature introduces portability that allows data sharing amongst the Hadoop ecosystem and other SQL querying
software products as well as usage of SQL language instead of the MapReduce/Java syntax. In some ways, this
integration is the ultimate driving force behind the Big SQL 3.0 feature. This is an enhancement that fully integrates
with third party SQL application tools that leverage common client drivers like JDBC and ODBC drivers. The
benefit is that the existing queries can be run with little to zero modifications. Big SQL 3.0 tables can be accessed
using the db2 data server client as well as the Big SQL JDBC driver. This is the same client that is used to access
DB2 on Linux, Unix and Windows. This makes it possible to easily access Big SQL tables with many popular API’s
including PERL, Python and Ruby. The Big SQL node and database are cataloged using the same commands as for
a DB2 LUW database. In addition, the CONNECT statement and almost all SQL statement are consistent with DB2
LUW. In this illustration, a query is run to access a table in Big SQL on a Hadoop cluster from the DB2 client
command window. Also available is an advanced query federation of query-able archives that allows data to be
combined - the data does not need to live in isolation on a single system. This is an additional integration capability
that allows us to send a distributed request to multiple data sources with a single SQL statement. These data sources
include different relational database management systems like DB2, Teradata, Oracle®, Netezza®, etc.
Big SQL 3.0’s third functionality pillar is performance, facilitating monitoring of resources and software
management. The performance tooling is built from the ground up to traverse low latency and optimized query
execution. Its operations run on memory, and its daemons live directly on the Hadoop cluster. [1] As a result there is
less overhead and reduced latency to retrieve data. There is no need to initiate MapReduce jobs on the cluster where
data is located.
Big SQL 3.0 query optimization has the following forms: query rewrites, query plan optimization, predicate
pushdowns and statistics-driven optimization. Query rewrites, is an ideal way to execute queries with minimal
resources. Query plan optimization includes many sub queries optimization. For instance, one of the main sub
queries is a join amongst data that is spread out throughout the Hadoop cluster. Other queries are shared aggregation
and distinct elimination, which leverage key definitions for a cohesive rewrite. Predicate pushdown processes data
reads based on static variable definitions. For example, Big SQL 3.0 distinct elimination does its query writes by
eliminating the use of the DISTINCT data type if the key values are already unique.[1] Finally, statistics-driven
optimization stores its column data in the query optimizer to determine an ideal query execution. These
transformations are based upon years of IBM’s relational database experience and greatly improve statistics and
heuristics.
The enterprise level capabilities are the fourth pillar of the Big SQL 3.0. These capabilities include
functionality like monitoring, management and security of the SQL engine running over the Hadoop system.
Resource, event, memory and workload management enhance the Hadoop functionality for stored data in
BigInsights. The security capabilities include secure user authentication, authorization, auditing capabilities, and
fine grained row/column access control. Another key component is the scheduler that eliminates any stray data not
relevant for a SQL query (known as partition elimination). It also ensures data processing amongst nodes is done
cohesively, as close to the data as possible.
Fig. 2. Hadoop components interaction with Big SQL 3.0
This section describes how to get hands on experience with Big SQL 3.0. IBM has provided an easy way
for you to get familiar with Big SQL 3.0. We have set up a cloud environment to allow you to get hands on
experience with this component of BigInsights.
NOTE: On Windows platform, the URLs can be accessed using Mozilla Firefox. On MAC platforms, use Safari.
SETUP
1. Go to http://bigsql.imdemocloud.com/ and submit the request form to gain access to Big SQL 3.0 Technology
Preview.[7]
The IM Demo Cloud Team will email you with instructions on how to setup your account on the cloud.
Once you’ve set up your account, go to http://my.imdemocloud.com and login with your credentials.
After you log in, verify you have access to Big SQL 3.0 Technology Preview project.
Under “Systems” it will include information for BigInsights web console and hostname of the Big SQL 3.0
environment.
Launch the BigInsights web console by clicking on “BigInsights Web Console” or going through URL:
http://hostname:8080. Use IM Demo cloud credentials to login.
Verify BigInsights services are up and running for Big SQL 3.0 by clicking on “Cluster Status”.
BigInsights supports a command line interface for Big SQL 3.0 called Java SQL shell (JSqsh pronounced “J-
skwish”). JSqsh is an open source project for querying JDBC databases through a shell-like environment.[3] To issue
Big SQL 3.0 commands from JSqsh, launch a terminal window (for instance, PUTTY).
5) During the initial launch, begin the JSqsh setup wizard by hitting Enter.
The wizard will display some default values to the connection properties and prompts you to change them.
Change each of its variables, one at a time, by entering the variable number and specifying it’s designated
new value.
o For instance, to change the value of the password variable, type variable 5 and hit Enter. It will
prompt you to enter a new password. Type the new password value and hit Enter again.
o Enter s to save your connection information and specify bigsql as the connection name.
7) Connect to the bigsql engine from the JSqsh terminal – once the Big SQL 3.0 connection is created.
Once you’re connected, verify JSqsh command prompt starts with 1>.
USAGE
Now that you’ve created the bigsql engine, you can issue Big SQL 3.0 queries to become familiar with the JSqsh
shell.
1) Type \show tables –e | more to display information for all available tables.
4) To view metadata for the tables, use \describe command with the table name in uppercase.
\describe USER.DEMO
Another way to query the system for metadata, run the following SQL command
o select tabschema, colname, colon, typename, length from syscat.columns where tabschema =
USER and tabname= ‘DEMO’;
Use Hadoop put command to load the data: “hadoop fs –put /local/file/system/f.txt
/hadoop/file/system/.”
Create a Hadoop table with table schema, delimiter value and table format.
Next, load data from HDFS into the newly created table.
load hadoop using file url ‘file:///sample/data/file.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’)
INTO TABLE SAMPLE overwrite;
Run a SELECT command to verify data was loaded into the table
Now you are ready to create your custom tables and try to test more complex queries.
Note: Within JSqsh, a query can be split across multiple lines. Query will only execute with a
semi-colon at the end. To review the history of commands that were recently executed in JSqsh,
type \history and hit Enter.
Conclusion
As we outlined above, the BigInsights – Big SQL 3.0 introduces rich SQL support in combination with
features that provide enterprise level performance, management and monitoring. IBM also provides an easy way for
you to explore this functionality. Give it a try and let us know what you think!
References
[1] Gary, Scott C. Ozcan, Fatma. (April 2014) SQL-on-Hadoop without compromise. How Big SQL 3.0 from IBM
represents an important leap forward for speed, portability and robust functionality in SQL-on-Hadoop solutions.
IBM Software Group. IBM.com. Retrieved from: http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?infotype=SA&subtype=WH&htmlfid=SWW14019USEN#loaded
[2] IBM InfoSphere BigInsights Version 3.0. IBM® Big SQL. IBM Knowledge Center. IBM.com Retrieved from:
http://www-
01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_sq
l_access.html
[3] Saracco, Cindy. (July 2014). Big SQL on Hadoop Part 2: The Big SQL Command Line Interface. Developers
Works. IBM.com. Retrieved from: https://developer.ibm.com/hadoop/docs/853/
[4] White, Tom. (2012). Hadoop: The Definitive Guide 3rd Edition. Sebastopol, CA: O’Reilly Media Inc.
[5] InfoSphere BigInsights: Bringing the power of Hadoop to the Enterprise. IBM.com: Retrieved from:
www.ibm.com/software/data/infosphere/biginsights
[6] IBM Big Data and Information Management. Now known as IBM Watson Foundations. IBM.com: Retreived
from: http://www-01.ibm.com/software/data/bigdata/
[7] Katsnelson, Leon (Author). (2013). Welcome to the BigInsights Quick Start on the IM Demo Cloud. United
States. Youtube.com
Additional Information