0% found this document useful (0 votes)
110 views49 pages

Course 3 1 Big SQL

Uploaded by

Arbi Marouane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views49 pages

Course 3 1 Big SQL

Uploaded by

Arbi Marouane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Using Big SQL to access data

residing in the HDFS

Data Science Foundations

© Copyright IBM Corporation 2018


Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit objectives

• Overview of Big SQL

• Understand how Big SQL fits in the Hadoop architecture

• Start and stop Big SQL using Ambari and command line

• Connect to Big SQL using command line

• Connect to Big SQL using IBM Data Server Manager

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL is SQL on Hadoop

• Big SQL builds on Apache Hive foundation


▪ Integrates with the Hive metastore
▪ Instead of MapReduce, uses powerful native
C/C++ MPP engine
• View on your data residing in the Hadoop
FileSystem
• No proprietary storage format
• Modern SQL:2011 capabilities
• Same SQL can be used on your warehouse data
with little or no modifications

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
SQL access for Hadoop: Why?

• Data warehouse modernization is a leading Hadoop use case


▪ Off-load "cold" warehouse data into query-ready Hadoop platform
▪ Explore / transform / analyze / aggregate social media data, log records, etc. and upload
summary data to warehouse
• Limited availability of skills in MapReduce, Pig, etc.
• SQL opens the data to a much wider audience
▪ Familiar, widely known syntax
▪ Common catalog for identifying data and structure

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
What does Big SQL provide?

• Comprehensive, standard SQL

• Optimization and performance

• Support for variety of storage formats

• Integration with RDBMSs

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL provides comprehensive, standard SQL

• SELECT: joins, unions, aggregates, subqueries . . .

• UPDATE/DELETE (HBase-managed tables)

• GRANT/REVOKE, INSERT … INTO

• SQL procedural logic (SQL PL)

• Stored procedures, user-defined functions

• IBM data server JDBC and ODBC drivers

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL provides powerful optimization and performance

• IBM MPP engine (native C++) replaces Java MapReduce layer

• Continuous running daemons (no start up latency)

• Message passing allow data to flow between nodes without persisting intermediate results

• In-memory operations with ability to spill to disk (useful for aggregations, sorts that exceed available RAM)

• Cost-based query optimization with 140+ rewrite rules

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL supports a variety of storage formats

• Text (delimited), Sequence, RCFile, ORC, Avro, Parquet

• Data persisted in:

▪ DFS

▪ Hive

▪ Hbase,

▪ WebHDFS URI* (Tech preview)

• No IBM proprietary format required

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL integrates with RDBMS

• BIG SQL LOAD command can load data from remote DB or table
• Query heterogeneous databases using federation feature

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL architecture

• Head (coordinator / management) node


▪ Listens to the JDBC/ODBC connections
▪ Compiles, optimizes, and coordinates
execution of the query
• Big SQL worker processes reside on
compute nodes (some or all)
• Worker nodes stream data between
each other as needed
• Workers can spill large data sets to local
disk if needed

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
The relationship between Big SQL and Db2

Big SQL and Db2 have the same "DNA"

• Bug fixes and enhancements (especially in Optimizer) in Db2 also benefit Big SQL.

• Enhancements via Big SQL re-integrated into "Db2 Main" often


• Features enabled for Big SQL for "almost free"
▪ HADR for Head Node
▪ Oracle PL/SQL support
▪ Declared Global Temporary Tables
▪ Time Travel Queries
▪ Much more…

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Starting and stopping Big SQL using Ambari

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Starting and stopping Big SQL from the command line

As bigsql user, run following command from the active/primary headnode

View the status of all Big SQL services:

$BIGSQL_HOME/bin/bigsql status

Stop Big SQL:

$BIGSQL_HOME/bin/bigsql stop

Start Big SQL:

$BIGSQL_HOME/bin/bigsql start

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Accessing Big SQL

• Java SQL Shell (JSqsh)


• Web tooling using Data
Server Manager (DSM)

• Tools that support IBM


JDBC/ODBC driver

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
JSqsh (1 of 3)

• Big SQL comes with a CLI pronounced as "jay-skwish" - Java SQL Shell
▪ Open source command client
▪ Query history and query recall
▪ Multiple result set display styles
▪ Multiple active sessions

• Started under /usr/ibmpacks/common-utils/current/jsqsh/bin

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
JSqsh (2 of 3)

• Run the JSqsh connection


wizard to supply connection
information:

• Connect to the bigsql database:


▪ ./jsqsh bigsql

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
JSqsh (3 of 3)

JSqsh's default command terminator is a semicolon


Semicolon is also a valid SQL PL statement terminator!

CREATE FUNCTION COMM_AMOUNT(SALARY DEC(9,2))


RETURNS DEC(9,2)
LANGUAGE SQL
BEGIN ATOMIC
DECLARE REMAINDER DEC(9,2) DEFAULT 0.0;
...
END;

JSqsh applies a basic heuristics to determine the actual statement end

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Web tooling using Data Server Manager (DSM)

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Connecting to Big SQL with Data Server Manager
Create a database connection to Big SQL within DSM

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Checkpoint

1. What is one of the many reasons to use Big SQL?

2. List the two ways you can access and use Big SQL.

3. What command is used to start Big SQL from the command line?

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Checkpoint solutions

1. What is one of the reasons to use Big SQL?


▪ Want to access your Hadoop data without using MapReduce
▪ Do not want to learn new languages like MapReduce
▪ No deep learning curve because Big SQL uses standard 2011 query structure
2. List the two ways you can work with Big SQL.
▪ JSqsh
▪ Web tooling from DSM
3. What command is used to start Big SQL from the command line?
▪ $BIGSQL_HOME/bin/bigsql start

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Creating Big SQL
schemas and tables

Data Science Foundations

© Copyright IBM Corporation 2018


Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit objectives

• Describe and create Big SQL schemas and tables

• Describe and list the Big SQL data types

• Work with various Big SQL DDLs

• Load data into Big SQL tables using best practices

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Big SQL terminologies

• Warehouse
▪ Default directory in the HDFS where the tables are stored
▪ Defaults to /apps/hive/warehouse/
• Schema
▪ Tables are organized into schemas
▪ Defaults to /apps/hive/warehouse/bigsql.db
• Table
▪ A directory with zero or more data files
▪ Example: /apps/hive/warehouse/bigsql.db/test1
▪ Tables may be stored anywhere

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Partitioned tables

• A table may be partitioned on one or more columns

• The partitioning columns are specified when the tables are created

• Data is stored within one directory for the specified partition

• Query predicates can be used to eliminate the need to scan every partition

▪ Only scan what is needed.

• Example:

▪ /apps/hive/warehouse/schema.db/tablename/col1=val1

▪ /apps/hive/warehouse/schema.db/tablename/col1=val2

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Creating Big SQL schemas

• Big SQL tables are organized into schemas


• A default schema is created that matches your login name
You can think of
• The USE command can be used to set the default schema (and to a schema as a
create it if it doesn't exist!) database!
• The CREATE SCHEMA command can be used to create a schema

[myhost][bigsql] 1> use "newschema";


[myhost][bigsql] 1> create hadoop table t1 (c1 int);
[myhost][bigsql] 1> insert into t1 values (10);
[myhost][bigsql] 1> select * from t1;
+------+
| C1 |
+------+
| 10 |
+-----

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Using web GUI to browse the HDFS

• Access via Ambari

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Creating a Big SQL table
• Standard CREATE TABLE DDL with extensions

create hadoop table users


(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;

• The "hadoop" keyword creates the table in the HDFS


• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Results from previous CREATE TABLE . . .

Data stored in subdirectory of Hive warehouse


/apps/hive/warehouse/myid.db/users
▪ Default schema is user ID. Can create new schemas
▪ “Table” is just a subdirectory under schema.db
▪ Table’s data are files within table subdirectory
• Meta data collected (Big SQL & Hive)
▪ SYSCAT.* and SYSHADOOP.* views
• Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL schema over existing DFS directory
contents
▪ Useful if table contents already in DFS
▪ Avoids need to LOAD data into Hive

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
More about CREATE TABLE

• HADOOP keyword
▪ Must be specified unless you enable the CREATE EXTERNAL HADOOP TABLE T1
SYSHADOOP.COMPATIBILITY_MODE (
• EXTERNAL keyword C1 INT NOT NULL PRIMARY KEY
▪ Indicates that the table is not managed CHECK (C1 > 0),
by the database manager C2 VARCHAR(10) NULL,
▪ When the table is dropped, the definition …
is removed, the data remains unaffected. )
• LOCATION keyword …
▪ Specifies the DFS directory to store the LOCATION
data files ‘/user/myusername/tables/user’

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
CREATE TABLE - partitioned tables

• Similar to Hive, Big SQL also has partitioned tables.


• Partitioned on one or more columns.
• Query predicates are used to eliminate unwanted data – speeding up the query.

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Additional CREATE TABLE features

• Constraints can be defined in-line in the table definition


• Variety of file formats available for STORED AS
▪ TEXT
▪ SEQUENCEFILE
▪ ORC
▪ PARQUETFILE
▪ More!
• NULL DEFINED AS clause for ROW FORMAT DELIMITED
▪ Explicit syntax for defining a NULL value in a delimited file
• Support for CREATE TABLE LIKE to clone another table

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
CREATE VIEW

• CREATE VIEW statement defines a view on one or more tables, views or nicknames
• Standard SQL syntax

create view my_users as

select fname, lname from bigsql.users where id > 100;

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Loading data into Big SQL tables

• Populating tables via LOAD


▪ Best runtime performance
• Populating tables via INSERT
▪ INSERT INTO … SELECT FROM
−Parallel read and write operations
▪ INSERT INTO …. VALUES(…)
−NOT parallelized. 1 file per insert. Not recommended, except for quick tests.
• Populate tables using CREATE …TABLE… AS SELECT….
▪ Create a Big SQL table based on contents of other table(s)

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Populating Big SQL tables via LOAD

• Typically the best runtime performance


• Load data from local or remote file system
load hadoop using file url
'ftp://myID:[email protected]:22/installdir/
bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='\t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;

• Load data from RDBMS (Db2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Populating Big SQL tables via INSERT (1 of 2)

• INSERT INTO … SELECT FROM …


CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet
( product_key INT NOT NULL, product_name VARCHAR(150),
Quantity INT, order_method_en VARCHAR(90) )
STORED AS parquetfile;
-- source tables do not need to be in Parquet format
insert into big_sales_parquet
SELECT sales.product_key, pnumb.product_name, sales.quantity,
meth.order_method_en
FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
and sales.quantity > 5500;
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Populating Big SQL tables via INSERT (2 of 2)

• INSERT INTO … VALUES (…)


▪ NOT parallelized
▪ 1 file is created per insert statement
▪ Not recommended, except for testing

Create table foo (col1 int, col2 varchar(10));

insert into foo values (1, ‘hello’);

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Populating Big SQL tables via CREATE … TABLE … AS SELECT …

• Source tables can be in different file formats or use different underlying storage mechanism.

-- source tables in this example are external (just DFS files)


CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, extern.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Data types

• Big SQL uses HCatalog (Hive Metastore) as its underlying data representation and access method

• SQL type

▪ This is the data type that the database engine supports

• Hive type

▪ This data type is defined in the Hive metastore for the table

▪ This type tells SerDe how to encode/decode values for the type

▪ The Big SQL reader converts values in the Hive types to SQL values on read

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
More about . . . data types
• Variety of primitives supported
▪ TINYINT, INT, DECIMAL(p,s), FLOAT, REAL, CHAR, VARCHAR,
TIMESTAMP, DATE, VARBINARY, BINARY, . . .
▪ Maximum 32K
• Complex types
▪ ARRAY: ordered collection of elements of same type
▪ Associative ARRAY (equivalent to Hive MAP type): unordered collection
of key/value pairs . Keys must be primitive types (consistent with Hive)
▪ ROW (equivalent to Hive STRUCT type): collection of elements of different types
▪ Nesting supported for array-of-rows and map-of-rows types
▪ Query predicates for ARRAY or ROW columns must specify elements of a primitive type

CREATE HADOOP TABLE mytable (id INT, info INT ARRAY[10]);


SELECT * FROM mytable WHERE info[8]=12;
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Data type mapping

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
BOOLEAN type

• The BOOLEAN type is defined as a SMALLINT SQL type in Big SQL

• In queries, BOOLEAN must be treated as a SMALLINT

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
DATE type

DATE can be stored two ways:

• DATE STORED AS TIMESTAMP


▪ DATE data type is mapped and stored as a Hive TIMESTAMP data type
▪ The default

• DATE STORED AS DATE


▪ DATE data type is mapped and stored as Hive DATE data type
▪ Potential performance impact, because Java readers rather than native readers are used to access
the DATE data type

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
When storing DATE as TIMESTAMP…

• The DATE type is defined in Hive as a TIMESTAMP


▪ This means data files with DATE values must be defined with a full time

• During all implicit conversions to DATE,


the time portion of the date is discarded
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
REAL and FLOAT types

• REAL is a 32-bit IEEE floating point


• FLOAT is a synonym for DOUBLE (64-bit IEEE floating point)
• Hive FLOAT  Big SQL REAL

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
STRING type
• Only provided for compatibility with Hive
• By default, STRING becomes VARCHAR 32K
▪ Largest size that the database engine supports
• Avoid the use of STRING!!!!!!
▪ It can cause significant performance degradation
▪ The database engine works in 32k pages
▪ Rows larger than 32k incur performance penalties and have limitations
▪ Hash join is not an option on rows where the total schema is > 32k
• Some alternatives:
▪ The best option is to use VARCHAR that matches your actual needs
▪ The bigsql.string.size property can be used to adjust the default down
▪ Property can be set server wide in bigsql-conf.xml

[localhost][bigsql] 1> set hadoop property bigsql.string.size=16;


[localhost][bigsql] 1> create hadoop table t1 (fname string, lname, string);

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Checkpoint

1. What is the recommended method for getting data into your Big SQL table for best performance?

2. The BOOLEAN type is defined as what SQL type in Big SQL?

3. Should you use the default STRING data type?

4. What does the EXTERNAL keyword do when used in a CREATE TABLE statement?

Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018
Checkpoint solutions

1. What is the recommended method for getting data into your Big SQL table for best performance?

▪ Using the LOAD operation

2. The BOOLEAN type is defined as what SQL type in Big SQL?

▪ SMALLINT

3. Should you use the default STRING data type?

▪ No, by default, STRING is mapped to the VARCHAR (32K) which can lead to performance degradation.

Recommend using the VARCHAR that you need or change the default size.

4. What does the EXTERNAL keyword do when used in a CREATE TABLE statement?

▪ Indicates that the table is not managed by the database manager

▪ When the table is dropped, the definition is removed, the data remains unaffected.
Using Big SQL to access data residing in the HDFS © Copyright IBM Corporation 2018

You might also like