Course 3 1 Big SQL
Course 3 1 Big SQL
• Start and stop Big SQL using Ambari and command line
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL is SQL on Hadoop
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
SQL access for Hadoop: Why?
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
What does Big SQL provide?
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL provides comprehensive, standard SQL
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL provides powerful optimization and performance
• Message passing allow data to flow between nodes without persisting intermediate results
• In-memory operations with ability to spill to disk (useful for aggregations, sorts that exceed available RAM)
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL supports a variety of storage formats
▪ DFS
▪ Hive
▪ Hbase,
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL integrates with RDBMS
• BIG SQL LOAD command can load data from remote DB or table
• Query heterogeneous databases using federation feature
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL architecture
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
The relationship between Big SQL and Db2
• Bug fixes and enhancements (especially in Optimizer) in Db2 also benefit Big SQL.
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Starting and stopping Big SQL using Ambari
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Starting and stopping Big SQL from the command line
$BIGSQL_HOME/bin/bigsql status
$BIGSQL_HOME/bin/bigsql stop
$BIGSQL_HOME/bin/bigsql start
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Accessing Big SQL
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
JSqsh (1 of 3)
• Big SQL comes with a CLI pronounced as "jay-skwish" - Java SQL Shell
▪ Open source command client
▪ Query history and query recall
▪ Multiple result set display styles
▪ Multiple active sessions
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
JSqsh (2 of 3)
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
JSqsh (3 of 3)
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Web tooling using Data Server Manager (DSM)
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Connecting to Big SQL with Data Server Manager
Create a database connection to Big SQL within DSM
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Checkpoint
2. List the two ways you can access and use Big SQL.
3. What command is used to start Big SQL from the command line?
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Checkpoint solutions
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Creating Big SQL
schemas and tables
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL terminologies
• Warehouse
▪ Default directory in the HDFS where the tables are stored
▪ Defaults to /apps/hive/warehouse/
• Schema
▪ Tables are organized into schemas
▪ Defaults to /apps/hive/warehouse/bigsql.db
• Table
▪ A directory with zero or more data files
▪ Example: /apps/hive/warehouse/bigsql.db/test1
▪ Tables may be stored anywhere
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Partitioned tables
• The partitioning columns are specified when the tables are created
• Query predicates can be used to eliminate the need to scan every partition
• Example:
▪ /apps/hive/warehouse/schema.db/tablename/col1=val1
▪ /apps/hive/warehouse/schema.db/tablename/col1=val2
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Creating Big SQL schemas
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Using web GUI to browse the HDFS
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Creating a Big SQL table
• Standard CREATE TABLE DDL with extensions
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
More about CREATE TABLE
• HADOOP keyword
▪ Must be specified unless you enable the CREATE EXTERNAL HADOOP TABLE T1
SYSHADOOP.COMPATIBILITY_MODE (
• EXTERNAL keyword C1 INT NOT NULL PRIMARY KEY
▪ Indicates that the table is not managed CHECK (C1 > 0),
by the database manager C2 VARCHAR(10) NULL,
▪ When the table is dropped, the definition …
is removed, the data remains unaffected. )
• LOCATION keyword …
▪ Specifies the DFS directory to store the LOCATION
data files ‘/user/myusername/tables/user’
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
CREATE TABLE - partitioned tables
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Additional CREATE TABLE features
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
CREATE VIEW
• CREATE VIEW statement defines a view on one or more tables, views or nicknames
• Standard SQL syntax
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Loading data into Big SQL tables
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Populating Big SQL tables via LOAD
• Load data from RDBMS (Db2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Populating Big SQL tables via INSERT (1 of 2)
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Populating Big SQL tables via CREATE … TABLE … AS SELECT …
• Source tables can be in different file formats or use different underlying storage mechanism.
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Data types
• Big SQL uses HCatalog (Hive Metastore) as its underlying data representation and access method
• SQL type
• Hive type
▪ This data type is defined in the Hive metastore for the table
▪ This type tells SerDe how to encode/decode values for the type
▪ The Big SQL reader converts values in the Hive types to SQL values on read
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
More about . . . data types
• Variety of primitives supported
▪ TINYINT, INT, DECIMAL(p,s), FLOAT, REAL, CHAR, VARCHAR,
TIMESTAMP, DATE, VARBINARY, BINARY, . . .
▪ Maximum 32K
• Complex types
▪ ARRAY: ordered collection of elements of same type
▪ Associative ARRAY (equivalent to Hive MAP type): unordered collection
of key/value pairs . Keys must be primitive types (consistent with Hive)
▪ ROW (equivalent to Hive STRUCT type): collection of elements of different types
▪ Nesting supported for array-of-rows and map-of-rows types
▪ Query predicates for ARRAY or ROW columns must specify elements of a primitive type
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
BOOLEAN type
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
DATE type
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
When storing DATE as TIMESTAMP…
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
STRING type
• Only provided for compatibility with Hive
• By default, STRING becomes VARCHAR 32K
▪ Largest size that the database engine supports
• Avoid the use of STRING!!!!!!
▪ It can cause significant performance degradation
▪ The database engine works in 32k pages
▪ Rows larger than 32k incur performance penalties and have limitations
▪ Hash join is not an option on rows where the total schema is > 32k
• Some alternatives:
▪ The best option is to use VARCHAR that matches your actual needs
▪ The bigsql.string.size property can be used to adjust the default down
▪ Property can be set server wide in bigsql-conf.xml
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Checkpoint
1. What is the recommended method for getting data into your Big SQL table for best performance?
4. What does the EXTERNAL keyword do when used in a CREATE TABLE statement?
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Checkpoint solutions
1. What is the recommended method for getting data into your Big SQL table for best performance?
▪ SMALLINT
▪ No, by default, STRING is mapped to the VARCHAR (32K) which can lead to performance degradation.
Recommend using the VARCHAR that you need or change the default size.
4. What does the EXTERNAL keyword do when used in a CREATE TABLE statement?
▪ When the table is dropped, the definition is removed, the data remains unaffected.
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018