Course 3 GP Advanced SQL and Tuning
Course 3 GP Advanced SQL and Tuning
&
Performance Tuning
Course Introduction
Greenplum Confidential
Objectives
Review Greenplum Shared Nothing Architecture & Concepts
By implementing a Greenplum Data Mart students will
Create a Greenplum Database and Schemas
Evaluate Logical and Physical Data Models
Determine appropriate Distribution Keys and create tables
Determine and implement partitioning and indexing strategies
Determine and implement optimal load methodologies
By
Greenplum Confidential
Prerequisites
Greenplum Fundamentals Training Course
Experience with Unix or Linux
Experience with SQL (PostGRESQL is nice!)
Experience with VI or EMACS
Experience with Shell Scripting
Greenplum Confidential
Module 1
Greenplum Confidential
Greenplum Architecture
MPP = Massively Parallel Processing
Multiple units of parallelism working on the same task
Parallel Database Operations
Parallel CPU Processing
Greenplum Units of Parallelism are Segments
Greenplum Confidential
Greenplum Architecture
Greenplum Confidential
Greenplum Confidential
Interconnect
Greenplum Confidential
Greenplum Confidential
Module 1
LAB
Greenplum Architecture Review
Greenplum Confidential
Module 2
Data Modeling,
Databases & Schemas
Greenplum Confidential
Data Models
Logical Data Model
Graphical representation of the business
requirements
Contains the things of importance in an organization
and how they relate to one another
Contains business textual definitions and examples
Attribute
Defines a property of the entity
Relationship
How entities relate to each other
Primary Key
A single or combination of attributes that uniquely
identifies the entity.
Greenplum Confidential
Greenplum Confidential
Fact
Measurable
Relates to a specific
instance
May have compound
Primary Key
Greenplum Confidential
Greenplum Confidential
Snowflake Schema
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Source
Data Example
Store ID
Stores
Unique
Small
Integer
1,2 N
Store Name
Stores
Non
Character
Toms Groceries
Stores
Non
Character
T,F
Has
Pharmacy
Greenplum Confidential
Columns
How the attributes of the entity are physically
represented
Constraints
Fact
Transaction
Greenplum Confidential
Notes:
Entities become tables
Attributes become
columns
Relationships may become
foreign key constraints
Greenplum Confidential
Data Segregation
Database
May have multiple in a Greenplum system
Greenplum databases do not share data
Schema
A physical grouping of database objects
Views
All or some table columns may be seen
Allows for user defined columns
- Instantiated at run time
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
About Schemas
Schemas are a way to logically organize objects within a
database (tables, views, functions, etc)
Use fully qualified names to access objects in a schema
EXAMPLE: myschema.mytable
Every database has a public schema
Other system level schemas include: pg_catalog,
Information_schema, pg_toast, pg_bitmapindex
Greenplum Confidential
DROP SCHEMA
Greenplum Confidential
Greenplum Confidential
Default Schemas
public schema
No objects at database creation
Generally used to store public objects
pg_catalog
This is the data dictionary
information_schema
This schema contains views and tables to make the
data dictionary easier to understand with less joins.
pg_bitmapindex
Used to store bitmap index objects
Greenplum Confidential
TOAST Schema
This schema is used specifically to store large
attribute objects (> 1 page of data)
The
Oversized
Attribute
Storage
Technique
Greenplum Confidential
Module 2
LAB
Data Modeling,
Databases & Schemas
Greenplum Confidential
Module 3
Physical Design Decisions
Greenplum Confidential
Constraints
Table
Column
Greenplum Confidential
Greenplum Confidential
Key Comparison
Primary Key
Distribution By Key
Limits:
Must be unique
Can be non-unique
Can be NULL
Greenplum Confidential
Data Distribution
CREATE TABLE tablename (
column_name1 data_type NOT NULL,
column_name2 data_type NOT NULL,
column_name3 data_type NOT NULL DEFAULT
default_value,
...)
[DISTRIBUTED
hash
OR BY (column_name)]
algorithm
[DISTRIBUTED RANDOMLY]
round-robin
Greenplum Confidential
Network
Disk I/O
CPU
Response time
is the
completion
time for ALL
segment
instances in a
Greenplum
database
0
1
2
3
4
5
Network
Disk I/O
CPU
0
1
2
3
Gender = M or F
4
5
JAN
FEB
MAR
APR
MAY
JUN
Network
Disk I/O
CPU
Segment
Host
customer
(c_customer_id)
freg_shopper
(f_customer_id)
Segment
Instance 1
customer
Segment
Host
(c_customer_id)
freq_shopper
(f_customer_id)
=
Greenplum Confidential
Segment Host
Segment Host
Segment
Instance 1
Segment
Instance 1
customer
customer
customer
(c_customer_id)
customer_id =102
(c_customer_id)
customer_id=745
(c_customer_id)
freg_shopper
freq_shopper
freq_shopper
(f_trans_number)
(f_trans_number)
customer_id=102
(f_trans_number)
customer_id=745
Segment Host
Segment Host
Segment
Instance 1
Segment
Instance 1
customer
customer
customer
(c_customer_id)
(c_customer_id)
(c_customer_id)
state
state
state
(s_statekey)
AK, AL, AZ, CA
(s_statekey)
AK, AL, AZ, CA
(s_statekey)
AK, AL, AZ, CA
DISTRIBUTED RANDOMLY
Uses a round robin algorithm
Any query that joins to a table that is
Greenplum Confidential
Greenplum Confidential
gpskew Example
gpadmin@mdw > gpskew -t public.customer
...
. . . Number of Distribution Column(s)
= {1}
. . . Distribution Column Name(s)
= cust_id
20080316:20:12:08:gpskew:mdw:gpadmin-[INFO]
. . . Total records (inc child tables)
= 5277811399
...
20080316:20:12:08:gpskew:mdw:gpadmin-[INFO]:-Skew result
. . . Maximum Record Count
= 66004567
. . . Segment instance hostname
= gpdbhost
. . . Segment instance name
= dw100
. . . Minimum Record Count
= 65942375
. . . Segment instance hostname
= gpdbhost
. . . Segment instance name
= dw100
. . . Record count variance
= 62192
Greenplum Confidential
Businesses have recognized the analytic value of detailed transactions and are
storing larger and larger volumes of this data
Greenplum Confidential
Partitions in Greenplum
A mechanism in Greenplum for use in physical database design
Increases the available options to improve the performance of a
certain class of queries
Only the rows of the qualified partitions (child table) in a query need to be
accessed
Greenplum Confidential
Greenplum Confidential
Trade date
Order date
Billing date
End of Month date
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
customer
Segment Host
Segment Host
Segment
Instance 1
Segment
Instance 51
customer
(c_customer_id)
(week_number)
(c_customer_id)
(week_number)
week_number = 200701
week_number = 200702
week_number = 200703
week_number = 200701
week_number = 200702
week_number = 200703
week_number = 200752
week_number = 200752
customer
(c_customer_id)
(week_number)
week_number = 200701
week_number = 200702
week_number = 200703
...
week_number = 200752
Distributed by c_customer_id
Partitioned by week_number
Greenplum Confidential
Partition Elimination
SELECT . . . FROM customer
WHERE week_number = 200702
Segment
Instance 0
customer
(c_customer_id)
(week_number)
week_number = 200701
week_number = 200702
week_number = 200703
week_number = 200704
week_number = 200705
week_number = 200706
week_number = 200707
week_number = 200708
week_number = 200709
...
week_number = 200750
week_number = 200751
week_number = 200752
Greenplum Confidential
customer
(c_customer_id)
(week_number)
week_number = 200702
week_number = 200703
week_number = 200752
week_number = 200801
Segment Host
Segment Host
Segment
Instance 1
Segment
Instance 51
customer
customer
(c_customer_id)
(week_number)
(c_customer_id)
(week_number)
week_number = 200702
week_number = 200703
week_number = 200702
week_number = 200703
week_number = 200752
week_number = 200801
...
week_number = 200752
week_number = 200801
customer parent
table
Parent table contains no data
By default data is not automatically
inserted into the correct child tables
- Use rewrite rules
week_number=200701 child
week_number = 200702 child
week_number = 200703 child
week_number = 200704 child
...
week_number = 200752 child
customer parent
week_number = 200701 child
week_number = 200702 child
week_number = 200703 child
week_number = 200704 child
OR
Use external tables to filter the data
...
week_number = 200752 child
Greenplum Confidential
Greenplum Confidential
Greenplum Release
Multi-level Partition example
CREATE TABLE orders
( order_id
NUMBER(12),
order_date
TIMESTAMP WITH LOCAL TIME ZONE,
order_mode
VARCHAR2(8),
customer_id
NUMBER(6),
order_status
NUMBER(2),
order_total
NUMBER(8,2),
sales_rep_id
NUMBER(6),
promotion_id
NUMBER(6)
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (order_date)
SUBPARTITION BY HASH (customer_id)
SUBPARTITIONS 8
(
partition minny,
starting '2004-12-01' ending '2006-12-01',
partition maxy
);
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Table Constraints
Primary Key Constraint *
EXAMPLE: CONSTRAINT mytable_pk PRIMARY KEY (mycolumn)
With OIDs
EXAMPLE: WITH (OIDS=TRUE)
This is used when updating or deleting rows via an ODBC
connection from another database. (Oracle, SQL Server, etc.)
* If a table has a primary key, this column (or group of columns) is chosen
as the distribution key for the table. If a table has a primary key, then it
can not have other columns with a unique constraint.
Greenplum Confidential
Column Constraints
Check Constraints
EXAMPLE:
Uniqueness Constraint *
EXAMPLE:
Module 3
LAB
Physical Design Decisions
Greenplum Confidential
Module 4
Data Loading
ETL versus ELT
Greenplum Confidential
Greenplum Confidential
File Based
Web Based
gpfdist Massively Parallel Load Tool
gpload
Greenplum Confidential
Greenplum Confidential
File Formats
text
csv
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Web Tables
- URL-based (http://)
- Output-based (OS commands, scripts)
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
gpload
Runs a load job as defined in a YAML formatted control file
gpload is a data loading utility that acts as an interface external
table parallel loading feature. Using a load specification defined in a
YAML formatted control file, gpload executes a load by invoking the
Greenplum parallel file server (gpfdist), creating an external table
definition based on the source data defined, and executing an
INSERT, UPDATE or MERGE operation to load the source data into
the target table in the database.
Usage:
gpload -f control_file [-l log_file] [-h hostname] [-p port] [-U username]
[-d database] [-W] [-v | -V] [-q] [-D] gpload -? | --version
Example:
gpload -f my_load.yml
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
About Sequences
Greenplum Master sequence generator process (seqserver)
Sequence SQL Commands:
CREATE SEQUENCE
ALTER SEQUENCE
DROP SEQUENCE
Sequence Function Limitations:
lastval and currval not supported
setval cannot be used in queries that update data
nextval not allowed in UPDATEs or DELETEs if mirrors are enabled
nextval may grab a block of values for some queries
PSQL Tips:
To list all sequences while in psql: \ds
To see an sequence definition: \d+ sequence_name
Greenplum Confidential
Sequence Example
Create a sequence named myseq
CREATE SEQUENCE myseq START 101;
Insert a row into a table that gets the next value:
INSERT INTO distributors
VALUES (nextval('myseq'), 'acme');
Reset the sequence counter value on the master:
SELECT setval('myseq', 201);
Not allowed in Greenplum DB (set sequence value on segments):
INSERT INTO product
VALUES (setval('myseq', 201), 'gizmo');
Greenplum Confidential
Using SQL
CREATE TEMPORARY TABLE MAXI AS (
SELECT COALESCE(MAX(CustomerID),0) AS MaxID
FROM dimensions.Customer
)
DISTRIBUTED BY (MaxID)
Create a volatile table with the max current
;
SELECT (RANK() OVER (ORDER BY phone)) + MAXI.MaxID,
custName,
address,
city,
state,
zipcode,
zipPlusFour,
countrycd,
phone
FROM public.customer_external
CROSS JOIN MAXI
;
Join the single row table with the max value to every row in the
source table, add the rank and voila, no gaps!
Greenplum Confidential
Module 4
LAB
Data Loading
ETL versus ELT
Greenplum Confidential
Module 5
Explain the Explain Plan
Analyzing Queries
Greenplum Confidential
Greenplum Confidential
EXPLAIN Output
Query plans are a right to left plan tree of nodes
read from the bottom up
Each node feeds its results to the node directly
above
There is one line for each node in the plan tree
Each node represents a single operation
Sequential Scan, Hash Join, Hash Aggregation, etc
EXPLAIN Example
Gather Motion 48:1 (slice1) (cost=14879102.69..14970283.82 rows=607875 width=548)
Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Unique (cost=14879102.69..14970283.82 rows=607875 width=548)
Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Sort (cost=14879102.69..14894299.54 rows=6078742 width=548)
Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Result (cost=0.00..3188311.04 rows=6078742 width=548)
-> Append (cost=0.00..3188311.04 rows=6078742 width=548)
-> Seq Scan on display_run b (cost=0.00..1.02 rows=1 width=548)
Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone
AND local_time < '2007-04-01 00:00:00'::timestamp without time zone AND (url::text
'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text)
-> Seq Scan on display_run_child_2007_03_month b
(cost=0.00..3188310.02 rows=6078741 width=50)
Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone
AND local_time < '2007-04-01 00:00:00'::timestamp without time zone AND (url::text
'delete::text OR url::text 'estimate time'::text OR url::text 'user time::text)
Greenplum Confidential
Greenplum Confidential
Partition Elimination
Gather Motion 48:1 (slice1) (cost=14879102.69..14970283.82 rows=607875 width=548)
Merge Key: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Unique (cost=14879102.69..14970283.82 rows=607875 width=548)
Group By: b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Sort (cost=14879102.69..14894299.54 rows=6078742 width=548)
Sort Key (Distinct): b.run_id, b.pack_id, b.local_time, b.session_id, b."domain"
-> Result (cost=0.00..3188311.04 rows=6078742 width=548)
-> Append (cost=0.00..3188311.04 rows=6078742 width=548)
-> Seq Scan on display_run b (cost=0.00..1.02 rows=1 width=548)
Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone AND local_time < '2007-04-01
00:00:00'::timestamp without time zone AND (url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user
time::text)
-> Seq Scan on display_run_child_2007_03_month b
(cost=0.00..3188310.02 rows=6078741 width=50)
Filter: local_time >= '2007-03-01 00:00:00'::timestamp without time zone AND local_time < '2007-04-01
00:00:00'::timestamp without time zone AND (url::text 'delete::text OR url::text 'estimate time'::text OR url::text 'user
time::text)
Sequential Scan
Hash Join
Hash Agg
Redistribute Motion
Slow Operators
Nested Loop Join
Merge Join
Sort
Broadcast Motion
Greenplum Confidential
Small table
broadcast
is
acceptable
Increase work_mem
-> HashAggregate (cost=74852.40..84739.94 rows=791003 width=45)
Group By: l_orderkey, l_partkey, l_comment
Rows out: 2999671 rows (seg1) with 13345 ms to first row, 71558 ms to end, start offset by 3.533
ms.
Executor memory: 2645K bytes avg, 5019K bytes max (seg1).
Work_mem used: 2321K bytes avg, 4062K bytes max (seg1).
Work_mem wanted: 237859K bytes avg, 237859K bytes max (seg1) to lessen workfile I/O
affecting 1 workers.
...
-> Seq Scan on lineitem (cost=0.00..44855.70 rows=2999670 width=45)
Rows out: 2999671 rows (seg1) with 0.571 ms to first row, 4167 ms to end, start offset by 4.105
Slice statistics:
(slice0) Executor memory: 211K bytes.
(slice1) * Executor memory: 2840K bytes avg x 2 workers, 5209K bytes max (seg1).
Work_mem: 4062K bytes max, 237859K bytes wanted.
Settings: work_mem=4MB
Total runtime: 73326.082 ms
(24 rows)
Greenplum Confidential
Spill Files
Operations performed in memory are optimal
Hash Join, Hash Agg, Sort, etc.
Greenplum Confidential
join_collapse_limit = 1
Greenplum Confidential
Greenplum Confidential
Module 5
LAB
Explain the Explain Plan
Analyzing Queries
Greenplum Confidential
Module 6
Improve Performance
With Statistics
Greenplum Confidential
Rows
The number of rows output by the plan node
Width
Total bytes of all the rows output by the plan node
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
default_statistics_target
Use default_statistics_target server configuration
Greenplum Confidential
gp_analyze_relative_error
The gp_analyze_relative_error server
configuration parameter sets the estimated
acceptable error in the cardinality of the table
A value of 0.5 is equivalent to an acceptable
error of 50%
The default value is 0.5
Decreasing the relative error fraction (accepting
less error) will increase the number of rows
sampled
gp_analyze_relative_error = .25
Greenplum Confidential
Module 6
LAB
Improve Performance
With Statistics
Greenplum Confidential
Module 7
Indexing Strategies
Greenplum Confidential
Indexes
Bitmap
Use for low cardinality columns
Use when column is often included in predicate
Hash
Available, but not recommended
Greenplum Confidential
Greenplum Confidential
B-Tree Index
Supports single value row lookups
Can be Unique or Non-Unique
Can be Single or Multi-Column
If multi-column, all columns in the index must be
included in the predicate for the index to be used.
EXAMPLE:
CREATE INDEX transid_btridx
ON facts.transaction
USING BTREE (transactionid)
;
Greenplum Confidential
Bitmap Index
Single column index is recommended
Provides very fast retrieval
Low cardinality columns
Gender
State / Province Code
EXAMPLE:
CREATE INDEX store_pharm_bmidx ON dimensions.store
USING BITMAP (pharmacy);
CREATE INDEX store_grocery_bmidx ON dimensions.store
USING BITMAP (grocery);
CREATE INDEX store_deli_bmidx ON dimensions.store
USING BITMAP (deli);
Greenplum Confidential
Index on Expressions
Only use when the expression appears often in
query predicates.
Very high overhead maintaining the index
during insert and update operations.
EXAMPLE:
CREATE INDEX lcase_storename_idx
ON store (LOWER(storename));
SUPPORTS:
SELECT * FROM store WHERE LOWER(storename) = top foods;
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Index Costs
Indexes are not free!
Indexes take space
Indexes incur overhead during inserts and
updates
Indexes incur processing overhead during
creation
Greenplum Confidential
Greenplum Confidential
Module 7
LAB
Indexing Strategies
Greenplum Confidential
Module 8
Advanced Reporting
Using
OLAP
Greenplum Confidential
What is OLAP?
On-Line Analytic Processing
Quickly provide answers to multi-dimensional
queries
Window Functions
Allows access to multiple rows in a single pass
Greenplum Confidential
Greenplum Confidential
pn | vn | sum
-----+----+--------100 | 20 | 0
100 | 40 | 2640000
200 | 10 | 0
200 | 40 | 0
300 | 30 | 0
400 | 50 | 0
500 | 30 | 120
600 | 30 | 60
700 | 40 | 1
800 | 40 | 1
Greenplum Confidential
(10 rows)
-----+----+--------100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
600 |
Greenplum Confidential
| 60
ROLLUP Example
the same query using ROLLUP
pn | vn | sum
-----+----+---------
100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
600 |
Greenplum Confidential
| 60
-----+----+--------100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
200 |
| 0
300 | 30 | 0
300 |
| 0
400 | 50 | 0
400 |
| 0
500 | 30 | 120
500 |
| 120
600 | 30 | 60
Greenplum Confidential
600 |
| 60
CUBE Example
pn | vn | sum
-----+----+---------
100 | 20 | 0
100 | 40 | 2640000
100 |
| 2640000
200 | 10 | 0
200 | 40 | 0
is the same as
400 | 50 | 0
500 | 30 | 120
200 |
| 0
300 | 30 | 0
300 |
400 |
500 |
| 0
| 0
| 120
600 | 30 | 60
600 |
| 60
700 | 40 | 1
700 |
| 1
800 | 40 | 1
Greenplum Confidential
800 |
| 1
grouping(customer)
FROM dsales_null
store
| customer
| product
| price
GROUP BY ROLLUP(store,customer,product);
-------+----------+---------+------s2
| c1
| p1
| 90
s2
| c1
| p2
| 50
s2
| p1
| 44
s1
| c2
| p2
| 70
s1
| c3
| p1
| 40
(5 rows)
store
-------+----------+---------+-----+---------s1
| c2
| p2
| 70
| 0
s1
| c2
| 70
| 0
s1
| c3
| p1
| 40
| 0
s1
| c3
| 40
| 0
s1
| 110 | 1
s2
| c1
| p1
| 90
| 0
s2
| c1
| p2
| 50
| 0
s2
| c1
| 140 | 0
| p1
| 44
Greenplum Confidential
s2
| 0
GROUP_ID Function
Returns 0 for each output row in a unique grouping set
Assigns a serial number >0 to each duplicate grouping set
found
Useful when combining grouping extension clauses
Can be used to filter output rows of duplicate grouping sets:
SELECT a, b, c, sum(p*q), group_id()
FROM sales
GROUP BY ROLLUP(a,b), CUBE(b,c)
HAVING group_id()<1
ORDER BY a,b,c;
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
SELECT * ,
row_number()
OVER(PARTITION BY cn)
FROM sale
ORDER BY cn;
row_number
| cn | vn | pn
| dt
| qty
| prc
------------+----+----+-----+------------+------+-----1
| 1
| 10 | 200 | 1401-03-01 | 1
| 0
| 1
| 30 | 300 | 1401-05-02 | 1
| 0
| 1
| 50 | 400 | 1401-06-01 | 1
| 0
| 1
| 30 | 500 | 1401-06-01 | 12
| 5
| 1
| 20 | 100 | 1401-05-01 | 1
| 0
| 2
| 50 | 400 | 1401-06-01 | 1
| 0
7
| 2 | 40 | 100 | 1401-01-01 | 1100 | 2400
row_number | cn | vn | pn | dt
| qty | prc
8
| 3 | 40 | 200 | 1401-04-01 | 1
| 0
------------+----+----+-----+------------+------+-----(8 rows)
1
| 1 | 10 | 200 | 1401-03-01 | 1
| 0
2
| 1
| 30 | 300 | 1401-05-02 | 1
| 0
| 1
| 50 | 400 | 1401-06-01 | 1
| 0
| 1
| 30 | 500 | 1401-06-01 | 12
| 5
| 1
| 20 | 100 | 1401-05-01 | 1
| 0
| 2
| 50 | 400 | 1401-06-01 | 1
| 0
| 2
Greenplum Confidential
| sum
----+---------
Greenplum Confidential
40
| 2640002
30
| 180
50
| 0
20
| 0
vn | sum
10 | 0
| rank
----+---------+-----(5 rows)
40
| 2640002 | 1
30
| 180
| 2
50
| 0
| 3
20
| 0
| 3
10
| 0
| 3
| dt
| avg
----+------------+---------
vn, dt,
10 | 03012008
| 30
30 | 05022008
| 0
AVG(prc*qty) OVER (
20 | 05012008
| 20
30 | 06012008
| 60
PARTITION BY vn
30 | 05022008
| 0
30 | 06012008
| 60
ORDER BY dt
30 | 06012008
| 60
30 | 06012008
| 60
ROWS BETWEEN
2 PRECEDING AND
2 FOLLOWING)
FROM sale;
30
| 06012008
| 60
30
| 06012008
| 60
40
| 06012008
| 140
40
| 06042008
| 90
40
| 06052008
| 120
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Module 8
LAB
Advanced Reporting
Using
OLAP
Greenplum Confidential
Module 9
Using Temporary Tables
Greenplum Confidential
Temporary Tables
Local to the session
Data may not be shared between sessions
Greenplum Confidential
Greenplum Confidential
Features
Distributed the same as any Greenplum table
May be indexed
May be Analyzed
Greenplum Confidential
Guidelines
Always designate DISTRIBUTED BY column(s)
Will default to the first valid column
Greenplum Confidential
Greenplum Confidential
Module 9
LAB
Using Temporary Tables
Greenplum Confidential
Module 10
PostGREs Functions
Greenplum Confidential
Types of Functions
Greenplum supports several function types.
query language functions (functions written in
SQL)
procedural language functions (functions
written in for example, PL/pgSQL or PL/Tcl)
internal functions
C-language functions
This class will only present the first two types of functions:
Query Language and Procedural Language
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Function Overloading
More than one function can be created with the
same name.
Must have different input parameter types or number
The optimizer selects which version to call based on
the input data type and number of arguments.
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Structure of PL/pgSQL
PL/pgSQL is a block-structured language.
CREATE FUNCTION somefunc() RETURNS integer AS $$
DECLARE quantity integer := 30;
BEGIN
RAISE NOTICE Quantity here is %, quantity;
-- Prints 30 quantity := 50;
-- Create a subblock
DECLARE quantity integer := 80;
BEGIN
RAISE NOTICE Quantity here is %, quantity;
-- Prints 80
RAISE NOTICE Outer quantity here is %, outerblock.quantity;
-- Prints 50
END;
RAISE NOTICE Quantity here is %, quantity;
-- Prints 50
RETURN quantity;
END;
$$ LANGUAGE plpgsql;
Greenplum Confidential
PL/pgSQL Declarations
All variables used in a block must be declared in the declarations
section of the block.
SYNTAX:
name [ CONSTANT ] type [ NOT NULL ] [ { DEFAULT | := } expression ];
EXAMPLES:
user_id integer;
quantity numeric(5);
url varchar;
myrow tablename%ROWTYPE;
myfield tablename.columnname%TYPE;
somerow RECORD;
Greenplum Confidential
PL/pgSQL %Types
%TYPE provides the data type of a variable or
table column.
You can use this to declare variables that will
hold database values.
SYNTAX: variable%TYPE
EXAMPLES:
custid customer.customerid%TYPE
tid
transaction.transid%TYPE
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
PL/pgSQL FOUND
FOUND is set by:
A SELECT INTO statement sets FOUND true if a row is assigned, false if
no row is returned.
A PERFORM statement sets FOUND true if it produces (and discards)
one or more rows, false if no row is produced.
UPDATE, INSERT, and DELETE statements set FOUND true if at least
one row is affected, false if no row is affected.
A FETCH statement sets FOUND true if it returns a row, false if no row
is returned.
A MOVE statement sets FOUND true if it successfully repositions the
cursor, false otherwise.
A FOR statement sets FOUND true if it iterates one or more times, else
false.
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
EXAMPLE:
CREATE OR REPLACE FUNCTION getAllStores()
RETURNS SETOF store AS
$BODY$
DECLARE r store%rowtype;
BEGIN
FOR r IN SELECT * FROM store WHERE storeid > 0 LOOP
-- can do some processing here
RETURN NEXT r;
-- return current row of SELECT
END LOOP;
RETURN;
END
$BODY$
LANGUAGE plpgsql;
Greenplum Confidential
PL/pgSQL Conditionals
IF THEN ELSE conditionals lets you execute commands based on
certain conditions.
IF boolean-expression THEN
statements
[ ELSIF boolean-expression THEN
statements
[ ELSIF boolean-expression THEN
statements
[ ELSE statements ]
END IF;
TIP: Put the most commonly occurring condition first!
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
CONNECTION_FAILURE
DIVISION_BY_ZERO
INTERVAL_FIELD_OVERFLOW
NULL_VALUE_VIOLATION
RAISE_EXCEPTION
PLPGSQL_ERROR
NO_DATA_FOUND
TOO_MANY_ROWS
Greenplum Confidential
Greenplum Confidential
Module 10
LAB
PostGREs Functions
Greenplum Confidential
Module 11
Controlling Access
Greenplum Confidential
Greenplum Confidential
Database Security
Problems
gpadmin user has access to everything
Unless roles and privileges are configured, all users
have access to everything
Unless resource queues are configured, there are no
limits on what users can run
Default pg_hba.conf is loosely configured (see
Security in GPDB.pdf)
Customers are tempted to share user accounts
Greenplum Confidential
Database Security
Roles
Each person should have their own role
Roles should be further divided into groups, which
are usually roles that cant login
Privileges should be granted at the group level
whenever possible
Privileges should be as restrictive as possible
Column level access can be accomplished with views
Roles are not related to OS users or groups
Greenplum Confidential
Creating Roles
The Greenplum Database manages database access
permissions using the concept of roles.
The concept of roles subsumes the concepts of users
and groups.
A role can be a database user, a group, or both. Roles
can own database objects (for example, tables) and can
assign privileges on those objects to other roles to
control access to the objects.
Roles can be members of other roles, thus a member
role can inherit the attributes and privileges of its
parent role.
Greenplum Confidential
Greenplum Confidential
Database Architecture
A KEY COMPONENT OF ACCESS CONTROL
Risk
Sales
Mktg
View
View
View
View
View
View
View
Risk
Sales
Table
Mktg
Table
Table
Table
Table
Table
Table
Table
Table
Greenplum Confidential
Table
Greenplum Confidential
Greenplum Confidential
Role attributes
A database role may have a number of
attributes that define what sort of tasks that
role can perform in the database.
These attributes can be set at role creation time or by
using the ALTER ROLE syntax.
Greenplum Confidential
Greenplum Confidential
Once the group role exists, you can add and remove members (user
roles) using the GRANT and REVOKE commands.
Example: GRANT admin TO john, sally;
Greenplum Confidential
Greenplum Confidential
Object privileges
Greenplum Confidential
Security examples
CREATE ROLE admin CREATEROLE CREATEDB;
CREATE ROLE batch;
GRANT select, insert, update, delete
ON dimensions.customer TO batch;
CREATE ROLE batchuser LOGIN;
GRANT batch TO batchuser;
Greenplum Confidential
Module 11
LAB
Controlling Access
Greenplum Confidential
Module 12
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
Greenplum Confidential
SET Operations
SET operations work against queries, not
tables
SET operations can improve query performance
Greenplum set operations include
UNION
- Insures uniqueness in the results set
UNION ALL
- Allow for duplication, use with caution!
INTERSECT
EXCEPT (same as MINUS)
Greenplum Confidential
Greenplum Confidential
IsDate Function
Not part of Greenplum suite, this function works like the
Oracle IsDate function.
CREATE OR REPLACE FUNCTION public.isdate(text)
RETURNS boolean AS $BODY$
begin
perform $1::date;
return true;
exception when others then
return false;
end
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
Greenplum Confidential
Using Arrays
With Greenplum it is possible to have an ARRAY data type.
Arrays may not be the distribution key for a table
You may not create indexes on ARRAY data type columns
EXAMPLE:
CREATE TEMPORARY TABLE dimensionsOID AS(
SELECT ARRAY(SELECT c.OID
FROM pg_class c,
pg_namespace n
WHERE n.oid = c.relnamespace
AND n.nspname = 'dimensions)
AS oidArray)
DISTRIBUTED RANDOMLY;
Greenplum Confidential
Greenplum Confidential
select pg_size_pretty(pg_database_size(datamart'));
select pg_size_pretty(pg_relation_size(dimension.customer'));
select pg_relation_size(facts.transaction');
select pg_database_size(datamart');
select pg_relation_size(schemaname||'.'||tablename) from
pg_tables
f. gpsizecalc -x dbname -s g
g. gpsizecalc -a dbname -t public.tablename -s g
Greenplum Confidential
Skew Analysis
Finding uneven data distribution & fixing it!
Command: gpskew
The gpskew script can be used to determine if table data is equally distributed across
all of the active segments. gpskew reports the following information:
The total number of records in the specified table. Note that for parent tables
with inherited child tables, you must use the -h option to show child table skew as well.
The number of records on each segment.
The variance of records between segments. It takes the segment which has the
maximum count, and the segment which has the minimum count, and reports the
difference between these two segments.
The segment response times (if -r is supplied).
The distribution key column names (if -c is supplied) .
Greenplum Confidential
Skew analysis
aparashar@mdw1 ~ $ > gpskew -t public.t1
20080823:09:43:43:gpskew:mdw1:aparashar-[INFO]:-Spawning parallel processes
........
20080823:09:43:45:gpskew:mdw1:aparashar-[INFO]:-Waiting for parallel processes batch [1], please wait...
20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:--------------------------------------20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:-Parallel process exit status
20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:--------------------------------------20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:--------------------------------------20080823:09:43:47:gpskew:mdw1:aparashar-[INFO]:-Number of Distribution Column(s)
= {1}
= c1
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:--------------------------------------20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Total records
= 2560000
= Primary Segments
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:--------------------------------------20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Skew result
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:-Maximum Record Count
= 320512
= 3114
= gp
= 319488
= 3114
= gp
= 1024
20080823:09:43:48:gpskew:mdw1:aparashar-[INFO]:--------------------------------------Greenplum Confidential
Skew analysis
Skew problem can be seen through queries
Use Explain analyze
ais_3114=# explain analyze SELECT count(*) from (SELECT count(*) from t2 group by c2) as a;
->
Subquery Scan a
HashAggregate
Hash chain length 1.0 avg, 1 max, using 1251 of 110160 buckets.
(slice1)
HashAggregate
Skew Analysis
Greenplum Confidential
Module 12
LAB
Advanced SQL Topics
Greenplum Confidential
Greenplum Confidential