You can subscribe to this list here.
2010 |
Jan
|
Feb
|
Mar
|
Apr
(4) |
May
(28) |
Jun
(12) |
Jul
(11) |
Aug
(12) |
Sep
(5) |
Oct
(19) |
Nov
(14) |
Dec
(12) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 |
Jan
(18) |
Feb
(30) |
Mar
(115) |
Apr
(89) |
May
(50) |
Jun
(44) |
Jul
(22) |
Aug
(13) |
Sep
(11) |
Oct
(30) |
Nov
(28) |
Dec
(39) |
2012 |
Jan
(38) |
Feb
(18) |
Mar
(43) |
Apr
(91) |
May
(108) |
Jun
(46) |
Jul
(37) |
Aug
(44) |
Sep
(33) |
Oct
(29) |
Nov
(36) |
Dec
(15) |
2013 |
Jan
(35) |
Feb
(611) |
Mar
(5) |
Apr
(55) |
May
(30) |
Jun
(28) |
Jul
(458) |
Aug
(34) |
Sep
(9) |
Oct
(39) |
Nov
(22) |
Dec
(32) |
2014 |
Jan
(16) |
Feb
(16) |
Mar
(42) |
Apr
(179) |
May
(7) |
Jun
(6) |
Jul
(9) |
Aug
|
Sep
(4) |
Oct
|
Nov
(3) |
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(2) |
May
(4) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
1
|
2
|
3
|
4
(1) |
5
(2) |
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
|
22
|
23
(7) |
24
|
25
|
26
|
27
|
28
|
29
|
30
|
31
(2) |
|
|
|
|
From: mason_s <ma...@us...> - 2010-08-31 17:22:41
|
Project "Postgres-XC". The branch, master has been updated via 06c882f78694a31749746aad0cb76347a3f7bcef (commit) from 58d1f0d4fe5de5db3655f52df9031abc1ce5b84e (commit) - Log ----------------------------------------------------------------- commit 06c882f78694a31749746aad0cb76347a3f7bcef Author: Mason Sharp <ma...@us...> Date: Tue Aug 31 13:21:36 2010 -0400 Fix a bug with AVG() We tried to avoid coordinator aggregate handling when only a single node is involved, but that causes a problem for some aggregates. diff --git a/src/backend/pgxc/plan/planner.c b/src/backend/pgxc/plan/planner.c index c8911b7..e18e813 100644 --- a/src/backend/pgxc/plan/planner.c +++ b/src/backend/pgxc/plan/planner.c @@ -2158,10 +2158,14 @@ pgxc_planner(Query *query, int cursorOptions, ParamListInfo boundParams) if (query_step->exec_nodes) query_step->combine_type = get_plan_combine_type( query, query_step->exec_nodes->baselocatortype); - /* Only set up if running on more than one node */ - if (query_step->exec_nodes && query_step->exec_nodes->nodelist && - list_length(query_step->exec_nodes->nodelist) > 1) - query_step->simple_aggregates = get_simple_aggregates(query); + + /* Set up simple aggregates */ + /* PGXCTODO - we should detect what types of aggregates are used. + * in some cases we can avoid the final step and merely proxy results + * (when there is only one data node involved) instead of using + * coordinator consolidation. At the moment this is needed for AVG() + */ + query_step->simple_aggregates = get_simple_aggregates(query); /* * Add sorting to the step ----------------------------------------------------------------------- Summary of changes: src/backend/pgxc/plan/planner.c | 12 ++++++++---- 1 files changed, 8 insertions(+), 4 deletions(-) hooks/post-receive -- Postgres-XC |
From: mason_s <ma...@us...> - 2010-08-31 17:11:13
|
Project "Postgres-XC". The branch, master has been updated via 58d1f0d4fe5de5db3655f52df9031abc1ce5b84e (commit) from 9894afcd6d20b47c303c49b8ed5141d2b7902237 (commit) - Log ----------------------------------------------------------------- commit 58d1f0d4fe5de5db3655f52df9031abc1ce5b84e Author: Mason Sharp <ma...@us...> Date: Tue Aug 31 13:09:34 2010 -0400 Fixed a bug in GTM introduced with timestamp piggybacking with GXID. Without this, one could not use GTM directly, only through the proxy Discovered and written by Andrei Martsinchyk diff --git a/src/gtm/main/gtm_txn.c b/src/gtm/main/gtm_txn.c index dec0a63..2205167 100644 --- a/src/gtm/main/gtm_txn.c +++ b/src/gtm/main/gtm_txn.c @@ -894,6 +894,7 @@ ProcessBeginTransactionGetGXIDCommand(Port *myport, StringInfo message) StringInfoData buf; GTM_TransactionHandle txn; GlobalTransactionId gxid; + GTM_Timestamp timestamp; MemoryContext oldContext; txn_isolation_level = pq_getmsgint(message, sizeof (GTM_IsolationLevel)); @@ -901,6 +902,9 @@ ProcessBeginTransactionGetGXIDCommand(Port *myport, StringInfo message) oldContext = MemoryContextSwitchTo(TopMemoryContext); + /* GXID has been received, now it's time to get a GTM timestamp */ + timestamp = GTM_TimestampGetCurrent(); + /* * Start a new transaction * @@ -931,6 +935,7 @@ ProcessBeginTransactionGetGXIDCommand(Port *myport, StringInfo message) pq_sendbytes(&buf, (char *)&proxyhdr, sizeof (GTM_ProxyMsgHeader)); } pq_sendbytes(&buf, (char *)&gxid, sizeof(gxid)); + pq_sendbytes(&buf, (char *)×tamp, sizeof (GTM_Timestamp)); pq_endmessage(myport, &buf); if (!myport->is_proxy) ----------------------------------------------------------------------- Summary of changes: src/gtm/main/gtm_txn.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) hooks/post-receive -- Postgres-XC |
From: Michael P. <mic...@us...> - 2010-08-23 08:15:54
|
Project "Postgres-XC". The annotated tag, v0.9.2 has been created at 7402b46760f3fd0d140fd177edfecaae31ec058b (tag) tagging d7ca431066efe320107581186ab853b28fa5f7a7 (commit) replaces v0.9.1 tagged by Michael P on Mon Aug 23 17:16:55 2010 +0900 - Log ----------------------------------------------------------------- Postgres-XC version 0.9.2 tag Andrei Martsinchyk (4): Reverted PANIC ereports back to ERROR Use ereport instead of Assert if sort operation is not defined If expressions should be added to ORDER BY clause of the step query Fixed a bug when searching terminating semicolon. Mason S (15): Fixed a bug when using a table after it had been created in the same Minor change that updates COPY so that it knows ahead Add support for immutable stored functions and enable support Support for pg_dump and pg_restore. Add support for views. When using hash distributed tables and a value that corresponds to Do not allow WITH RECURSIVE or windowing functions until Do not yet allow creation of temp tables until we properly handle them. Handle more types of queries to determine whether or not they Allow rules to be created, provided that they do not use NOTIFY, Fixed assertion Add support for ORDER BY adn DISTINCT. Changed some error messages so that they will not be duplicates In Postgres-XC, the error stack may overflow because Fix a crash that may occur within the pooler when a Michael P (3): Remove an unnecessary file for the repository. Support for RENAME/DROP SCHEMA with sequences Support for cold synchronization of catalog table of coordinator. Pavan Deolasee (3): Add support for ALTER Sequence. Michael Paquier with some editorilization from Pavan Deolasee Add a missing include file from the previous commit Handling ALTER SEQUENCE at the GTM proxy as well. Michael Paquier. ----------------------------------------------------------------------- hooks/post-receive -- Postgres-XC |
From: Michael P. <mic...@us...> - 2010-08-23 08:15:17
|
Project "Postgres-XC". The tag, v0.9.2 has been deleted was d7ca431066efe320107581186ab853b28fa5f7a7 ----------------------------------------------------------------------- d7ca431066efe320107581186ab853b28fa5f7a7 Support for cold synchronization of catalog table of coordinator. ----------------------------------------------------------------------- hooks/post-receive -- Postgres-XC |
From: Michael P. <mic...@us...> - 2010-08-23 08:08:59
|
Project "Postgres-XC". The branch, master has been updated via 9894afcd6d20b47c303c49b8ed5141d2b7902237 (commit) from ba6f32f142cf8731ba29e5495e0f97f3b0455da0 (commit) - Log ----------------------------------------------------------------- commit 9894afcd6d20b47c303c49b8ed5141d2b7902237 Author: Michael P <mic...@us...> Date: Mon Aug 23 17:00:01 2010 +0900 Support for Global timestamp in Postgres-XC. When a transaction is begun on Coordinator, a transaction sending a BEGIN message to GTM receives back a timestamp with the usual GXID. This timestamp is calculated from the clock of GTM server. With that, nodes in the cluster can adjust their own timeline with GTM by calculating a delta value based on the GTM timestamp and their local clock. Like GXID and snapshot, a timestamp is also sent down to Datanodes in case so as to keep consistent timestamp values between coordinator and datanodes. This commit supports global timestamp values for now(), statement_timestamp, transaction_timestamp,current_date, current_time, current_timestamp, localtime, local_timestamp and now(). clock_timestamp and timeofday make their calculation based on the local server clock so they get their results from the local node where it is run. Their use could lead to inconsistencies if used in a transaction involving several Datanodes. diff --git a/src/backend/access/transam/gtm.c b/src/backend/access/transam/gtm.c index f9499c9..c7f3547 100644 --- a/src/backend/access/transam/gtm.c +++ b/src/backend/access/transam/gtm.c @@ -67,14 +67,14 @@ CloseGTM(void) } GlobalTransactionId -BeginTranGTM(void) +BeginTranGTM(GTM_Timestamp *timestamp) { GlobalTransactionId xid = InvalidGlobalTransactionId; CheckConnection(); // TODO Isolation level if (conn) - xid = begin_transaction(conn, GTM_ISOLATION_RC); + xid = begin_transaction(conn, GTM_ISOLATION_RC, timestamp); /* If something went wrong (timeout), try and reset GTM connection * and retry. This is safe at the beginning of a transaction. @@ -84,7 +84,7 @@ BeginTranGTM(void) CloseGTM(); InitGTM(); if (conn) - xid = begin_transaction(conn, GTM_ISOLATION_RC); + xid = begin_transaction(conn, GTM_ISOLATION_RC, timestamp); } return xid; } diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c index f2a9d74..5176e85 100644 --- a/src/backend/access/transam/varsup.c +++ b/src/backend/access/transam/varsup.c @@ -75,11 +75,17 @@ GetForceXidFromGTM(void) * The new XID is also stored into MyProc before returning. */ TransactionId +#ifdef PGXC +GetNewTransactionId(bool isSubXact, bool *timestamp_received, GTM_Timestamp *timestamp) +#else GetNewTransactionId(bool isSubXact) +#endif { TransactionId xid; -#ifdef PGXC +#ifdef PGXC bool increment_xid = true; + + *timestamp_received = false; #endif /* @@ -102,8 +108,10 @@ GetNewTransactionId(bool isSubXact) * This will help with GTM connection issues- we will not * block all other processes. */ - xid = (TransactionId) BeginTranGTM(); + xid = (TransactionId) BeginTranGTM(timestamp); + *timestamp_received = true; } + #endif LWLockAcquire(XidGenLock, LW_EXCLUSIVE); @@ -144,18 +152,20 @@ GetNewTransactionId(bool isSubXact) * exclude it from other snapshots. */ next_xid = (TransactionId) BeginTranAutovacuumGTM(); - } else { + } + else + { elog (DEBUG1, "Getting XID for autovacuum worker (analyze)"); /* try and get gxid directly from GTM */ - next_xid = (TransactionId) BeginTranGTM(); + next_xid = (TransactionId) BeginTranGTM(NULL); } } else if (GetForceXidFromGTM()) { elog (DEBUG1, "Force get XID from GTM"); /* try and get gxid directly from GTM */ - next_xid = (TransactionId) BeginTranGTM(); + next_xid = (TransactionId) BeginTranGTM(NULL); } - + if (TransactionIdIsValid(next_xid)) { xid = next_xid; diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 673aad1..8a946cc 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -208,6 +208,19 @@ static TimestampTz stmtStartTimestamp; static TimestampTz xactStopTimestamp; /* + * PGXC receives from GTM a timestamp value at the same time as a GXID + * This one is set as GTMxactStartTimestamp and is a return value of now(), current_transaction(). + * GTMxactStartTimestamp is also sent to each node with gxid and snapshot and delta is calculated locally. + * GTMdeltaTimestamp is used to calculate current_statement as its value can change + * during a transaction. Delta can have a different value through the nodes of the cluster + * but its uniqueness in the cluster is maintained thanks to the global value GTMxactStartTimestamp. + */ +#ifdef PGXC +static TimestampTz GTMxactStartTimestamp = 0; +static TimestampTz GTMdeltaTimestamp = 0; +#endif + +/* * GID to be used for preparing the current transaction. This is also * global to a whole transaction, so we don't keep it in the state stack. */ @@ -315,12 +328,28 @@ GetCurrentGlobalTransactionId(void) * * This will return the GXID of the specified transaction, * getting one from the GTM if it's not yet set. + * It also returns a timestamp value if a GXID has been taken from GTM */ static GlobalTransactionId GetGlobalTransactionId(TransactionState s) { + GTM_Timestamp gtm_timestamp; + bool received_tp; + + /* + * Here we receive timestamp at the same time as gxid. + */ if (!GlobalTransactionIdIsValid(s->globalTransactionId)) - s->globalTransactionId = (GlobalTransactionId) GetNewTransactionId(s->parent != NULL); + s->globalTransactionId = (GlobalTransactionId) GetNewTransactionId(s->parent != NULL, + &received_tp, + >m_timestamp); + + /* Set a timestamp value if and only if it has been received from GTM */ + if (received_tp) + { + GTMxactStartTimestamp = (TimestampTz) gtm_timestamp; + GTMdeltaTimestamp = GTMxactStartTimestamp - stmtStartTimestamp; + } return s->globalTransactionId; } @@ -473,8 +502,20 @@ AssignTransactionId(TransactionState s) s->transactionId, isSubXact ? "true" : "false"); } else -#endif + { + GTM_Timestamp gtm_timestamp; + bool received_tp; + + s->transactionId = GetNewTransactionId(isSubXact, &received_tp, >m_timestamp); + if (received_tp) + { + GTMxactStartTimestamp = (TimestampTz) gtm_timestamp; + GTMdeltaTimestamp = GTMxactStartTimestamp - stmtStartTimestamp; + } + } +#else s->transactionId = GetNewTransactionId(isSubXact); +#endif if (isSubXact) SubTransSetParent(s->transactionId, s->parent->transactionId); @@ -536,7 +577,15 @@ GetCurrentCommandId(bool used) TimestampTz GetCurrentTransactionStartTimestamp(void) { + /* + * In Postgres-XC, Transaction start timestamp is the value received + * from GTM along with GXID. + */ +#ifdef PGXC + return GTMxactStartTimestamp; +#else return xactStartTimestamp; +#endif } /* @@ -545,7 +594,17 @@ GetCurrentTransactionStartTimestamp(void) TimestampTz GetCurrentStatementStartTimestamp(void) { + /* + * For Postgres-XC, Statement start timestamp is adjusted at each node + * (Coordinator and Datanode) with a difference value that is calculated + * based on the global timestamp value received from GTM and the local + * clock. This permits to follow the GTM timeline in the cluster. + */ +#ifdef PGXC + return stmtStartTimestamp + GTMdeltaTimestamp; +#else return stmtStartTimestamp; +#endif } /* @@ -557,11 +616,36 @@ GetCurrentStatementStartTimestamp(void) TimestampTz GetCurrentTransactionStopTimestamp(void) { + /* + * As for Statement start timestamp, stop timestamp has to + * be adjusted with the delta value calculated with the + * timestamp received from GTM and the local node clock. + */ +#ifdef PGXC + TimestampTz timestamp; + + if (xactStopTimestamp != 0) + return xactStopTimestamp + GTMdeltaTimestamp; + + timestamp = GetCurrentTimestamp() + GTMdeltaTimestamp; + + return timestamp; +#else if (xactStopTimestamp != 0) return xactStopTimestamp; + return GetCurrentTimestamp(); +#endif } +#ifdef PGXC +TimestampTz +GetCurrentGTMStartTimestamp(void) +{ + return GTMxactStartTimestamp; +} +#endif + /* * SetCurrentStatementStartTimestamp */ @@ -580,6 +664,20 @@ SetCurrentTransactionStopTimestamp(void) xactStopTimestamp = GetCurrentTimestamp(); } +#ifdef PGXC +/* + * SetCurrentGTMDeltaTimestamp + * + * Note: Sets local timestamp delta with the value received from GTM + */ +void +SetCurrentGTMDeltaTimestamp(TimestampTz timestamp) +{ + GTMxactStartTimestamp = timestamp; + GTMdeltaTimestamp = GTMxactStartTimestamp - xactStartTimestamp; +} +#endif + /* * GetCurrentTransactionNestLevel * @@ -950,7 +1048,12 @@ RecordTransactionCommit(void) MyProc->inCommit = true; SetCurrentTransactionStopTimestamp(); +#ifdef PGXC + /* In Postgres-XC, stop timestamp has to follow the timeline of GTM */ + xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp; +#else xlrec.xact_time = xactStopTimestamp; +#endif xlrec.nrels = nrels; xlrec.nsubxacts = nchildren; rdata[0].data = (char *) (&xlrec); @@ -1275,7 +1378,12 @@ RecordTransactionAbort(bool isSubXact) else { SetCurrentTransactionStopTimestamp(); +#ifdef PGXC + /* In Postgres-XC, stop timestamp has to follow the timeline of GTM */ + xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp; +#else xlrec.xact_time = xactStopTimestamp; +#endif } xlrec.nrels = nrels; xlrec.nsubxacts = nchildren; @@ -1576,7 +1684,12 @@ StartTransaction(void) */ xactStartTimestamp = stmtStartTimestamp; xactStopTimestamp = 0; +#ifdef PGXC + /* For Postgres-XC, transaction start timestamp has to follow the GTM timeline */ + pgstat_report_xact_timestamp(GTMxactStartTimestamp); +#else pgstat_report_xact_timestamp(xactStartTimestamp); +#endif /* * initialize current transaction state fields diff --git a/src/backend/pgxc/pool/datanode.c b/src/backend/pgxc/pool/datanode.c index 0f4072d..ba56ca1 100644 --- a/src/backend/pgxc/pool/datanode.c +++ b/src/backend/pgxc/pool/datanode.c @@ -893,6 +893,48 @@ data_node_send_snapshot(DataNodeHandle *handle, Snapshot snapshot) } /* + * Send the timestamp down to the Datanode + */ +int +data_node_send_timestamp(DataNodeHandle *handle, TimestampTz timestamp) +{ + int msglen = 12; /* 4 bytes for msglen and 8 bytes for timestamp (int64) */ + uint32 n32; + int64 i = (int64) timestamp; + + /* msgType + msgLen */ + if (ensure_out_buffer_capacity(handle->outEnd + 1 + msglen, handle) != 0) + { + add_error_message(handle, "out of memory"); + return EOF; + } + handle->outBuffer[handle->outEnd++] = 't'; + msglen = htonl(msglen); + memcpy(handle->outBuffer + handle->outEnd, &msglen, 4); + handle->outEnd += 4; + + /* High order half first */ +#ifdef INT64_IS_BUSTED + /* don't try a right shift of 32 on a 32-bit word */ + n32 = (i < 0) ? -1 : 0; +#else + n32 = (uint32) (i >> 32); +#endif + n32 = htonl(n32); + memcpy(handle->outBuffer + handle->outEnd, &n32, 4); + handle->outEnd += 4; + + /* Now the low order half */ + n32 = (uint32) i; + n32 = htonl(n32); + memcpy(handle->outBuffer + handle->outEnd, &n32, 4); + handle->outEnd += 4; + + return 0; +} + + +/* * Add another message to the list of errors to be returned back to the client * at the convenient time */ diff --git a/src/backend/pgxc/pool/execRemote.c b/src/backend/pgxc/pool/execRemote.c index 43569e0..f065289 100644 --- a/src/backend/pgxc/pool/execRemote.c +++ b/src/backend/pgxc/pool/execRemote.c @@ -1074,6 +1074,7 @@ data_node_begin(int conn_count, DataNodeHandle ** connections, int i; struct timeval *timeout = NULL; RemoteQueryState *combiner; + TimestampTz timestamp = GetCurrentGTMStartTimestamp(); /* Send BEGIN */ for (i = 0; i < conn_count; i++) @@ -1081,6 +1082,9 @@ data_node_begin(int conn_count, DataNodeHandle ** connections, if (GlobalTransactionIdIsValid(gxid) && data_node_send_gxid(connections[i], gxid)) return EOF; + if (GlobalTimestampIsValid(timestamp) && data_node_send_timestamp(connections[i], timestamp)) + return EOF; + if (data_node_send_query(connections[i], "BEGIN")) return EOF; } @@ -1222,8 +1226,13 @@ data_node_commit(int conn_count, DataNodeHandle ** connections) else sprintf(buffer, "COMMIT PREPARED 'T%d'", gxid); - /* We need to use a new xid, the data nodes have reset */ - two_phase_xid = BeginTranGTM(); + /* + * We need to use a new xid, the data nodes have reset + * Timestamp has already been set with BEGIN on remote Datanodes, + * so don't use it here. + */ + two_phase_xid = BeginTranGTM(NULL); + for (i = 0; i < conn_count; i++) { if (data_node_send_gxid(connections[i], two_phase_xid)) @@ -1338,6 +1347,7 @@ DataNodeCopyBegin(const char *query, List *nodelist, Snapshot snapshot, bool is_ bool need_tran; GlobalTransactionId gxid; RemoteQueryState *combiner; + TimestampTz timestamp = GetCurrentGTMStartTimestamp(); if (conn_count == 0) return NULL; @@ -1432,6 +1442,19 @@ DataNodeCopyBegin(const char *query, List *nodelist, Snapshot snapshot, bool is_ pfree(copy_connections); return NULL; } + if (conn_count == 1 && data_node_send_timestamp(connections[i], timestamp)) + { + /* + * If a transaction involves multiple connections timestamp, is + * always sent down to Datanodes with data_node_begin. + * An autocommit transaction needs the global timestamp also, + * so handle this case here. + */ + add_error_message(connections[i], "Can not send request"); + pfree(connections); + pfree(copy_connections); + return NULL; + } if (snapshot && data_node_send_snapshot(connections[i], snapshot)) { add_error_message(connections[i], "Can not send request"); @@ -2027,7 +2050,8 @@ ExecRemoteQuery(RemoteQueryState *node) bool force_autocommit = step->force_autocommit; bool is_read_only = step->read_only; GlobalTransactionId gxid = InvalidGlobalTransactionId; - Snapshot snapshot = GetActiveSnapshot(); + Snapshot snapshot = GetActiveSnapshot(); + TimestampTz timestamp = GetCurrentGTMStartTimestamp(); DataNodeHandle **connections = NULL; DataNodeHandle **primaryconnection = NULL; int i; @@ -2133,6 +2157,20 @@ ExecRemoteQuery(RemoteQueryState *node) (errcode(ERRCODE_INTERNAL_ERROR), errmsg("Failed to send command to data nodes"))); } + if (total_conn_count == 1 && data_node_send_timestamp(primaryconnection[0], timestamp)) + { + /* + * If a transaction involves multiple connections timestamp is + * always sent down to Datanodes with data_node_begin. + * An autocommit transaction needs the global timestamp also, + * so handle this case here. + */ + pfree(connections); + pfree(primaryconnection); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to send command to data nodes"))); + } if (snapshot && data_node_send_snapshot(primaryconnection[0], snapshot)) { pfree(connections); @@ -2184,6 +2222,20 @@ ExecRemoteQuery(RemoteQueryState *node) (errcode(ERRCODE_INTERNAL_ERROR), errmsg("Failed to send command to data nodes"))); } + if (total_conn_count == 1 && data_node_send_timestamp(connections[i], timestamp)) + { + /* + * If a transaction involves multiple connections timestamp is + * always sent down to Datanodes with data_node_begin. + * An autocommit transaction needs the global timestamp also, + * so handle this case here. + */ + pfree(connections); + pfree(primaryconnection); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to send command to data nodes"))); + } if (snapshot && data_node_send_snapshot(connections[i], snapshot)) { pfree(connections); diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index a6f4767..84c70c6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -429,8 +429,9 @@ SocketBackend(StringInfo inBuf) errmsg("invalid frontend message type %d", qtype))); break; #ifdef PGXC /* PGXC_DATANODE */ - case 'g': - case 's': + case 'g': /* GXID */ + case 's': /* Snapshot */ + case 't': /* Timestamp */ break; #endif @@ -2951,6 +2952,8 @@ PostgresMain(int argc, char *argv[], const char *username) int xmax; int xcnt; int *xip; + /* Timestamp info */ + TimestampTz timestamp; #endif #define PendingConfigOption(name,val) \ @@ -4015,6 +4018,17 @@ PostgresMain(int argc, char *argv[], const char *username) pq_getmsgend(&input_message); SetGlobalSnapshotData(xmin, xmax, xcnt, xip); break; + + case 't': /* timestamp */ + timestamp = (TimestampTz) pq_getmsgint64(&input_message); + pq_getmsgend(&input_message); + + /* + * Set in xact.x the static Timestamp difference value with GTM + * and the timestampreceivedvalues for Datanode reference + */ + SetCurrentGTMDeltaTimestamp(timestamp); + break; #endif /* PGXC */ default: diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index 375a830..4634278 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -32,6 +32,10 @@ #include "utils/builtins.h" #include "utils/datetime.h" +#ifdef PGXC +#include "pgxc/pgxc.h" +#endif + /* * gcc's -ffast-math switch breaks routines that expect exact results from * expressions like timeval / SECS_PER_HOUR, where timeval is double. diff --git a/src/gtm/client/fe-protocol.c b/src/gtm/client/fe-protocol.c index 051bb1d..0847b0d 100644 --- a/src/gtm/client/fe-protocol.c +++ b/src/gtm/client/fe-protocol.c @@ -350,12 +350,22 @@ gtmpqParseSuccess(GTM_Conn *conn, GTM_Result *result) break; case TXN_BEGIN_GETGXID_RESULT: + if (gtmpqGetnchar((char *)&result->gr_resdata.grd_gxid_tp.gxid, + sizeof (GlobalTransactionId), conn)) + { + result->gr_status = -1; + break; + } + if (gtmpqGetnchar((char *)&result->gr_resdata.grd_gxid_tp.timestamp, + sizeof (GTM_Timestamp), conn)) + result->gr_status = -1; + break; case TXN_BEGIN_GETGXID_AUTOVACUUM_RESULT: case TXN_PREPARE_RESULT: if (gtmpqGetnchar((char *)&result->gr_resdata.grd_gxid, sizeof (GlobalTransactionId), conn)) result->gr_status = -1; - break; + break; case TXN_COMMIT_RESULT: case TXN_ROLLBACK_RESULT: @@ -393,9 +403,11 @@ gtmpqParseSuccess(GTM_Conn *conn, GTM_Result *result) result->gr_status = -1; break; } + if (gtmpqGetnchar((char *)&result->gr_resdata.grd_txn_get_multi.timestamp, + sizeof (GTM_Timestamp), conn)) + result->gr_status = -1; break; - case TXN_COMMIT_MULTI_RESULT: case TXN_ROLLBACK_MULTI_RESULT: if (gtmpqGetnchar((char *)&result->gr_resdata.grd_txn_rc_multi.txn_count, diff --git a/src/gtm/client/gtm_client.c b/src/gtm/client/gtm_client.c index 9df28c7..35f81ae 100644 --- a/src/gtm/client/gtm_client.c +++ b/src/gtm/client/gtm_client.c @@ -48,7 +48,7 @@ disconnect_gtm(GTM_Conn *conn) * Transaction Management API */ GlobalTransactionId -begin_transaction(GTM_Conn *conn, GTM_IsolationLevel isolevel) +begin_transaction(GTM_Conn *conn, GTM_IsolationLevel isolevel, GTM_Timestamp *timestamp) { bool txn_read_only = false; GTM_Result *res = NULL; @@ -78,7 +78,12 @@ begin_transaction(GTM_Conn *conn, GTM_IsolationLevel isolevel) goto receive_failed; if (res->gr_status == 0) - return res->gr_resdata.grd_gxid; + { + if (timestamp) + *timestamp = res->gr_resdata.grd_gxid_tp.timestamp; + + return res->gr_resdata.grd_gxid_tp.gxid; + } else return InvalidGlobalTransactionId; diff --git a/src/gtm/main/Makefile b/src/gtm/main/Makefile index 7fcdf82..5d8aaea 100644 --- a/src/gtm/main/Makefile +++ b/src/gtm/main/Makefile @@ -3,7 +3,7 @@ top_build_dir=../.. include $(top_build_dir)/gtm/Makefile.global -OBJS=main.o gtm_thread.o gtm_txn.o gtm_seq.o gtm_snap.o ../common/libgtm.a ../libpq/libpqcomm.a ../path/libgtmpath.a +OBJS=main.o gtm_thread.o gtm_txn.o gtm_seq.o gtm_snap.o gtm_time.o ../common/libgtm.a ../libpq/libpqcomm.a ../path/libgtmpath.a LDFLAGS=-L$(top_build_dir)/common -L$(top_build_dir)/libpq LIBS=-lpthread diff --git a/src/gtm/main/gtm_time.c b/src/gtm/main/gtm_time.c new file mode 100644 index 0000000..ea795af --- /dev/null +++ b/src/gtm/main/gtm_time.c @@ -0,0 +1,41 @@ +/*------------------------------------------------------------------------- + * + * gtm_time.c + * Timestamp handling on GTM + * + * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * Portions Copyright (c) 2010 Nippon Telegraph and Telephone Corporation + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + +#include "gtm/gtm.h" +#include "gtm/gtm_c.h" +#include "gtm/gtm_time.h" +#include <time.h> +#include <sys/time.h> + +GTM_Timestamp +GTM_TimestampGetCurrent(void) +{ + struct timeval tp; + GTM_Timestamp result; + + gettimeofday(&tp, NULL); + + result = (GTM_Timestamp) tp.tv_sec - + ((GTM_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY); + +#ifdef HAVE_INT64_TIMESTAMP + result = (result * USECS_PER_SEC) + tp.tv_usec; +#else + result = result + (tp.tv_usec / 1000000.0); +#endif + + return result; +} diff --git a/src/gtm/main/gtm_txn.c b/src/gtm/main/gtm_txn.c index 6090ae1..dec0a63 100644 --- a/src/gtm/main/gtm_txn.c +++ b/src/gtm/main/gtm_txn.c @@ -18,6 +18,8 @@ #include "gtm/palloc.h" #include "gtm/gtm.h" #include "gtm/gtm_txn.h" +#include "gtm/gtm_c.h" +#include "gtm/gtm_time.h" #include "gtm/assert.h" #include "gtm/stringinfo.h" #include "gtm/libpq.h" @@ -840,6 +842,7 @@ ProcessBeginTransactionCommand(Port *myport, StringInfo message) bool txn_read_only; StringInfoData buf; GTM_TransactionHandle txn; + GTM_Timestamp timestamp; MemoryContext oldContext; txn_isolation_level = pq_getmsgint(message, sizeof (GTM_IsolationLevel)); @@ -860,6 +863,9 @@ ProcessBeginTransactionCommand(Port *myport, StringInfo message) MemoryContextSwitchTo(oldContext); + /* GXID has been received, now it's time to get a GTM timestamp */ + timestamp = GTM_TimestampGetCurrent(); + pq_beginmessage(&buf, 'S'); pq_sendint(&buf, TXN_BEGIN_RESULT, 4); if (myport->is_proxy) @@ -869,6 +875,7 @@ ProcessBeginTransactionCommand(Port *myport, StringInfo message) pq_sendbytes(&buf, (char *)&proxyhdr, sizeof (GTM_ProxyMsgHeader)); } pq_sendbytes(&buf, (char *)&txn, sizeof(txn)); + pq_sendbytes(&buf, (char *)×tamp, sizeof (GTM_Timestamp)); pq_endmessage(myport, &buf); if (!myport->is_proxy) @@ -1003,6 +1010,7 @@ ProcessBeginTransactionGetGXIDCommandMulti(Port *myport, StringInfo message) StringInfoData buf; GTM_TransactionHandle txn[GTM_MAX_GLOBAL_TRANSACTIONS]; GlobalTransactionId gxid, end_gxid; + GTM_Timestamp timestamp; GTMProxy_ConnID txn_connid[GTM_MAX_GLOBAL_TRANSACTIONS]; MemoryContext oldContext; int count; @@ -1042,6 +1050,9 @@ ProcessBeginTransactionGetGXIDCommandMulti(Port *myport, StringInfo message) MemoryContextSwitchTo(oldContext); + /* GXID has been received, now it's time to get a GTM timestamp */ + timestamp = GTM_TimestampGetCurrent(); + end_gxid = gxid + txn_count; if (end_gxid < gxid) end_gxid += FirstNormalGlobalTransactionId; @@ -1058,6 +1069,7 @@ ProcessBeginTransactionGetGXIDCommandMulti(Port *myport, StringInfo message) } pq_sendbytes(&buf, (char *)&txn_count, sizeof(txn_count)); pq_sendbytes(&buf, (char *)&gxid, sizeof(gxid)); + pq_sendbytes(&buf, (char *)&(timestamp), sizeof (GTM_Timestamp)); pq_endmessage(myport, &buf); if (!myport->is_proxy) diff --git a/src/gtm/proxy/proxy_main.c b/src/gtm/proxy/proxy_main.c index f5f6e65..66b1594 100644 --- a/src/gtm/proxy/proxy_main.c +++ b/src/gtm/proxy/proxy_main.c @@ -988,6 +988,7 @@ ProcessResponse(GTMProxy_ThreadInfo *thrinfo, GTMProxy_CommandInfo *cmdinfo, { StringInfoData buf; GlobalTransactionId gxid; + GTM_Timestamp timestamp; switch (cmdinfo->ci_mtype) { @@ -1011,9 +1012,13 @@ ProcessResponse(GTMProxy_ThreadInfo *thrinfo, GTMProxy_CommandInfo *cmdinfo, if (gxid < res->gr_resdata.grd_txn_get_multi.start_gxid) gxid += FirstNormalGlobalTransactionId; + /* Send back to each client the same timestamp value asked in this message */ + timestamp = res->gr_resdata.grd_txn_get_multi.timestamp; + pq_beginmessage(&buf, 'S'); pq_sendint(&buf, TXN_BEGIN_GETGXID_RESULT, 4); pq_sendbytes(&buf, (char *)&gxid, sizeof (GlobalTransactionId)); + pq_sendbytes(&buf, (char *)×tamp, sizeof (GTM_Timestamp)); pq_endmessage(cmdinfo->ci_conn->con_port, &buf); pq_flush(cmdinfo->ci_conn->con_port); } diff --git a/src/include/access/gtm.h b/src/include/access/gtm.h index 3831f09..4878d92 100644 --- a/src/include/access/gtm.h +++ b/src/include/access/gtm.h @@ -20,7 +20,7 @@ extern int GtmCoordinatorId; extern bool IsGTMConnected(void); extern void InitGTM(void); extern void CloseGTM(void); -extern GlobalTransactionId BeginTranGTM(void); +extern GlobalTransactionId BeginTranGTM(GTM_Timestamp *timestamp); extern GlobalTransactionId BeginTranAutovacuumGTM(void); extern int CommitTranGTM(GlobalTransactionId gxid); extern int RollbackTranGTM(GlobalTransactionId gxid); diff --git a/src/include/access/transam.h b/src/include/access/transam.h index 6c2a5b8..d7c7b7b 100644 --- a/src/include/access/transam.h +++ b/src/include/access/transam.h @@ -16,7 +16,9 @@ #define TRANSAM_H #include "access/xlogdefs.h" - +#ifdef PGXC +#include "gtm/gtm_c.h" +#endif /* ---------------- * Special transaction ID values @@ -157,8 +159,10 @@ extern XLogRecPtr TransactionIdGetCommitLSN(TransactionId xid); extern void SetNextTransactionId(TransactionId xid); extern void SetForceXidFromGTM(bool value); extern bool GetForceXidFromGTM(void); -#endif /* PGXC */ +extern TransactionId GetNewTransactionId(bool isSubXact, bool *timestamp_received, GTM_Timestamp *timestamp); +#else extern TransactionId GetNewTransactionId(bool isSubXact); +#endif /* PGXC */ extern TransactionId ReadNewTransactionId(void); extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Name oldest_datname); diff --git a/src/include/access/xact.h b/src/include/access/xact.h index 5bd157b..01fb498 100644 --- a/src/include/access/xact.h +++ b/src/include/access/xact.h @@ -157,6 +157,10 @@ extern TimestampTz GetCurrentTransactionStartTimestamp(void); extern TimestampTz GetCurrentStatementStartTimestamp(void); extern TimestampTz GetCurrentTransactionStopTimestamp(void); extern void SetCurrentStatementStartTimestamp(void); +#ifdef PGXC +extern TimestampTz GetCurrentGTMStartTimestamp(void); +extern void SetCurrentGTMDeltaTimestamp(TimestampTz timestamp); +#endif extern int GetCurrentTransactionNestLevel(void); extern bool TransactionIdIsCurrentTransactionId(TransactionId xid); extern void CommandCounterIncrement(void); diff --git a/src/include/gtm/gtm_c.h b/src/include/gtm/gtm_c.h index 1a04064..0a4c941 100644 --- a/src/include/gtm/gtm_c.h +++ b/src/include/gtm/gtm_c.h @@ -55,6 +55,12 @@ typedef int32 GTM_TransactionHandle; #define InvalidTransactionHandle -1 +/* + * As GTM and Postgres-XC packages are separated, GTM and XC's API + * use different type names for timestamps and sequences, but they have to be the same! + */ +typedef int64 GTM_Timestamp; /* timestamp data is 64-bit based */ + typedef int64 GTM_Sequence; /* a 64-bit sequence */ typedef struct GTM_SequenceKeyData { diff --git a/src/include/gtm/gtm_client.h b/src/include/gtm/gtm_client.h index 05e44bf..9db6884 100644 --- a/src/include/gtm/gtm_client.h +++ b/src/include/gtm/gtm_client.h @@ -21,8 +21,14 @@ typedef union GTM_ResultData { GTM_TransactionHandle grd_txnhandle; /* TXN_BEGIN */ - GlobalTransactionId grd_gxid; /* TXN_BEGIN_GETGXID - * TXN_PREPARE + + struct + { + GlobalTransactionId gxid; + GTM_Timestamp timestamp; + } grd_gxid_tp; /* TXN_BEGIN_GETGXID */ + + GlobalTransactionId grd_gxid; /* TXN_PREPARE * TXN_COMMIT * TXN_ROLLBACK */ @@ -47,6 +53,7 @@ typedef union GTM_ResultData { int txn_count; /* TXN_BEGIN_GETGXID_MULTI */ GlobalTransactionId start_gxid; + GTM_Timestamp timestamp; } grd_txn_get_multi; struct @@ -101,7 +108,7 @@ void disconnect_gtm(GTM_Conn *conn); /* * Transaction Management API */ -GlobalTransactionId begin_transaction(GTM_Conn *conn, GTM_IsolationLevel isolevel); +GlobalTransactionId begin_transaction(GTM_Conn *conn, GTM_IsolationLevel isolevel, GTM_Timestamp *timestamp); GlobalTransactionId begin_transaction_autovacuum(GTM_Conn *conn, GTM_IsolationLevel isolevel); int commit_transaction(GTM_Conn *conn, GlobalTransactionId gxid); int abort_transaction(GTM_Conn *conn, GlobalTransactionId gxid); diff --git a/src/include/gtm/gtm_time.h b/src/include/gtm/gtm_time.h new file mode 100644 index 0000000..b3d7005 --- /dev/null +++ b/src/include/gtm/gtm_time.h @@ -0,0 +1,37 @@ +/*------------------------------------------------------------------------- + * + * gtm_time.h + * + * + * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * Portions Copyright (c) 2010 Nippon Telegraph and Telephone Corporation + * + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + +#ifndef GTM_TIME_H +#define GTM_TIME_H + +/* Julian-date equivalents of Day 0 in Unix and GTM reckoning */ +#define UNIX_EPOCH_JDATE 2440588 /* == date2j(1970, 1, 1) */ +#define GTM_EPOCH_JDATE 2451545 /* == date2j(2000, 1, 1) */ + +#define SECS_PER_YEAR (36525 * 864) /* avoid floating-point computation */ +#define SECS_PER_DAY 86400 +#define SECS_PER_HOUR 3600 +#define SECS_PER_MINUTE 60 +#define MINS_PER_HOUR 60 + +#ifdef HAVE_INT64_TIMESTAMP +#define USECS_PER_DAY INT64CONST(86400000000) +#define USECS_PER_HOUR INT64CONST(3600000000) +#define USECS_PER_MINUTE INT64CONST(60000000) +#define USECS_PER_SEC INT64CONST(1000000) +#endif + +GTM_Timestamp GTM_TimestampGetCurrent(void); + +#endif diff --git a/src/include/pgxc/datanode.h b/src/include/pgxc/datanode.h index 849d84a..4202e2e 100644 --- a/src/include/pgxc/datanode.h +++ b/src/include/pgxc/datanode.h @@ -18,6 +18,7 @@ #define DATANODE_H #include "postgres.h" #include "gtm/gtm_c.h" +#include "utils/timestamp.h" #include "nodes/pg_list.h" #include "utils/snapshot.h" #include <unistd.h> @@ -88,6 +89,7 @@ extern int ensure_out_buffer_capacity(size_t bytes_needed, DataNodeHandle * hand extern int data_node_send_query(DataNodeHandle * handle, const char *query); extern int data_node_send_gxid(DataNodeHandle * handle, GlobalTransactionId gxid); extern int data_node_send_snapshot(DataNodeHandle * handle, Snapshot snapshot); +extern int data_node_send_timestamp(DataNodeHandle * handle, TimestampTz timestamp); extern int data_node_receive(const int conn_count, DataNodeHandle ** connections, struct timeval * timeout); diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h index 906ceb6..801e89b 100644 --- a/src/include/utils/timestamp.h +++ b/src/include/utils/timestamp.h @@ -23,6 +23,10 @@ #include "utils/int8.h" #endif +#ifdef PGXC +#include "pgxc/pgxc.h" +#endif + /* * Timestamp represents absolute time. * @@ -45,6 +49,11 @@ #ifdef HAVE_INT64_TIMESTAMP +/* + * PGXC note: GTM and Postgres-XC packages have to be separated. + * Both use use different type names for timestamp, but those types have to be the same! + */ + typedef int64 Timestamp; typedef int64 TimestampTz; typedef int64 TimeOffset; @@ -190,6 +199,10 @@ typedef struct #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) / 1000.0)) #endif +#ifdef PGXC +#define InvalidGlobalTimestamp ((TimestampTz) 0) +#define GlobalTimestampIsValid(timestamp) ((TimestampTz) (timestamp)) != InvalidGlobalTimestamp +#endif /* Set at postmaster start */ extern TimestampTz PgStartTime; ----------------------------------------------------------------------- Summary of changes: src/backend/access/transam/gtm.c | 6 +- src/backend/access/transam/varsup.c | 22 +++++-- src/backend/access/transam/xact.c | 117 ++++++++++++++++++++++++++++++++++- src/backend/pgxc/pool/datanode.c | 42 +++++++++++++ src/backend/pgxc/pool/execRemote.c | 58 ++++++++++++++++- src/backend/tcop/postgres.c | 18 +++++- src/backend/utils/adt/timestamp.c | 4 + src/gtm/client/fe-protocol.c | 16 ++++- src/gtm/client/gtm_client.c | 9 ++- src/gtm/main/Makefile | 2 +- src/gtm/main/gtm_time.c | 41 ++++++++++++ src/gtm/main/gtm_txn.c | 12 ++++ src/gtm/proxy/proxy_main.c | 5 ++ src/include/access/gtm.h | 2 +- src/include/access/transam.h | 8 ++- src/include/access/xact.h | 4 + src/include/gtm/gtm_c.h | 6 ++ src/include/gtm/gtm_client.h | 13 +++- src/include/gtm/gtm_time.h | 37 +++++++++++ src/include/pgxc/datanode.h | 2 + src/include/utils/timestamp.h | 13 ++++ 21 files changed, 410 insertions(+), 27 deletions(-) create mode 100644 src/gtm/main/gtm_time.c create mode 100644 src/include/gtm/gtm_time.h hooks/post-receive -- Postgres-XC |
From: mason_s <ma...@us...> - 2010-08-23 06:25:06
|
Project "Postgres-XC". The branch, master has been updated via ba6f32f142cf8731ba29e5495e0f97f3b0455da0 (commit) from d97c52965478dafe7f5f2ccabc588c6279c117e7 (commit) - Log ----------------------------------------------------------------- commit ba6f32f142cf8731ba29e5495e0f97f3b0455da0 Author: Mason Sharp <ma...@us...> Date: Mon Aug 23 15:22:28 2010 +0900 Fix a visibility warning due to not taking into account transactions that are running globally across all nodes in the cluster. diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index be24657..5f97320 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -625,6 +625,11 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum) TransactionId result; int index; +#ifdef PGXC + if (TransactionIdIsValid(RecentGlobalXmin)) + return RecentGlobalXmin; +#endif + LWLockAcquire(ProcArrayLock, LW_SHARED); /* ----------------------------------------------------------------------- Summary of changes: src/backend/storage/ipc/procarray.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) hooks/post-receive -- Postgres-XC |
From: mason_s <ma...@us...> - 2010-08-23 04:27:42
|
Project "Postgres-XC". The branch, master has been updated via d97c52965478dafe7f5f2ccabc588c6279c117e7 (commit) from b6602543d5dd6dfa4005db41c73d0136f74af13e (commit) - Log ----------------------------------------------------------------- commit d97c52965478dafe7f5f2ccabc588c6279c117e7 Author: Mason Sharp <ma...@us...> Date: Mon Aug 23 13:24:57 2010 +0900 In Postgres-XC, when extedngin the clog the status assertion occasionally fails when under a very heavy for long tests. We break the two assertions out and make the second one this a warning instead of an assertion for now. diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 68e3869..919e146 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -555,8 +555,22 @@ SimpleLruWritePage(SlruCtl ctl, int slotno, SlruFlush fdata) /* Re-acquire control lock and update page state */ LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE); +#ifdef PGXC + /* + * In Postgres-XC the status assertion occasionally fails when + * under a very heavy for long tests. + * We break the two assertions out and make the second one + * this a warning instead of an assertion for now. + */ + Assert(shared->page_number[slotno] == pageno); + + if (shared->page_status[slotno] != SLRU_PAGE_WRITE_IN_PROGRESS) + elog(WARNING, "Unexpected page status in SimpleLruWritePage(), status = %d, was expecting 3 (SLRU_PAGE_WRITE_IN_PROGRESS) for page %d", + shared->page_status[slotno], shared->page_number[slotno]); +#else Assert(shared->page_number[slotno] == pageno && shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS); +#endif /* If we failed to write, mark the page dirty again */ if (!ok) ----------------------------------------------------------------------- Summary of changes: src/backend/access/transam/slru.c | 14 ++++++++++++++ 1 files changed, 14 insertions(+), 0 deletions(-) hooks/post-receive -- Postgres-XC |
From: mason_s <ma...@us...> - 2010-08-23 04:14:09
|
Project "Postgres-XC". The branch, master has been updated via b6602543d5dd6dfa4005db41c73d0136f74af13e (commit) from cfb29183b57e811e0dfcf3641c5cc58458b2584a (commit) - Log ----------------------------------------------------------------- commit b6602543d5dd6dfa4005db41c73d0136f74af13e Author: M S <masonsharp@5.105.180.203.e.iijmobile.jp> Date: Mon Aug 23 13:11:31 2010 +0900 Initial support for multi-step queries, including cross-node joins. Note that this is a "version 1.0" implementation, borrowing some code from the SQL/MED patch. This means that all cross-node joins take place on a Coordinator by pulling up data from the data nodes. Some queries will therefore execute quite slowly, but they will at least execute. In this patch, all columns are SELECTed from the remote table, but at least simple WHERE clauses are pushed down to the remote nodes. We will optimize query processing in the future. Note that the same connections to remote nodes are used in multiple steps. To get around that problem, we just add a materialization node above each RemoteQuery node, and force all results to be fetched first on the Coordinator. This patch also allows UNION, EXCEPT and INTERSECT, and other more complex SELECT statements to run now. It includes a fix for single-step, multi-node LIMIT and OFFSET. It also includes EXPLAIN output from the Coordinator's point of view. Adding these changes introduced a problem with AVG(), which is currently not working. diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 950a2f1..aa92917 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -565,6 +565,11 @@ explain_outNode(StringInfo str, case T_WorkTableScan: pname = "WorkTable Scan"; break; +#ifdef PGXC + case T_RemoteQuery: + pname = "Data Node Scan"; + break; +#endif case T_Material: pname = "Materialize"; break; @@ -668,6 +673,9 @@ explain_outNode(StringInfo str, case T_SeqScan: case T_BitmapHeapScan: case T_TidScan: +#ifdef PGXC + case T_RemoteQuery: +#endif if (((Scan *) plan)->scanrelid > 0) { RangeTblEntry *rte = rt_fetch(((Scan *) plan)->scanrelid, @@ -686,6 +694,26 @@ explain_outNode(StringInfo str, appendStringInfo(str, " %s", quote_identifier(rte->eref->aliasname)); } +#ifdef PGXC + if (IsA(plan, RemoteQuery)) + { + RemoteQuery *remote_query = (RemoteQuery *) plan; + + /* if it is a single-step plan, print out the sql being used */ + if (remote_query->sql_statement) + { + char *realsql = NULL; + realsql = strcasestr(remote_query->sql_statement, "explain"); + if (!realsql) + realsql = remote_query->sql_statement; + else + realsql += 8; /* skip "EXPLAIN" */ + + appendStringInfo(str, " %s", + quote_identifier(realsql)); + } + } +#endif break; case T_BitmapIndexScan: appendStringInfo(str, " on %s", @@ -854,6 +882,9 @@ explain_outNode(StringInfo str, case T_ValuesScan: case T_CteScan: case T_WorkTableScan: +#ifdef PGXC + case T_RemoteQuery: +#endif show_scan_qual(plan->qual, "Filter", ((Scan *) plan)->scanrelid, diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 22fd416..c8b1456 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -2925,6 +2925,20 @@ ATRewriteTables(List **wqueue) } } +#ifdef PGXC + /* + * In PGXC, do not check the FK constraints on the Coordinator, and just return + * That is because a SELECT is generated whose plan will try and use + * the data nodes. We (currently) do not want to do that on the Coordinator, + * when the command is passed down to the data nodes it will + * peform the check locally. + * This issue was introduced when we added multi-step handling, + * it caused foreign key constraints to fail. + * PGXCTODO - issue for pg_catalog or any other cases? + */ + if (IS_PGXC_COORDINATOR) + return; +#endif /* * Foreign key constraints are checked in a final pass, since (a) it's * generally best to examine each one separately, and (b) it's at least diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c index 25f350f..01e2548 100644 --- a/src/backend/executor/execAmi.c +++ b/src/backend/executor/execAmi.c @@ -44,6 +44,9 @@ #include "executor/nodeWindowAgg.h" #include "executor/nodeWorktablescan.h" #include "nodes/nodeFuncs.h" +#ifdef PGXC +#include "pgxc/execRemote.h" +#endif #include "utils/syscache.h" @@ -183,6 +186,11 @@ ExecReScan(PlanState *node, ExprContext *exprCtxt) ExecWorkTableScanReScan((WorkTableScanState *) node, exprCtxt); break; +#ifdef PGXC + case T_RemoteQueryState: + ExecRemoteQueryReScan((RemoteQueryState *) node, exprCtxt); + break; +#endif case T_NestLoopState: ExecReScanNestLoop((NestLoopState *) node, exprCtxt); break; diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c index 446b400..2cd3298 100644 --- a/src/backend/executor/nodeMaterial.c +++ b/src/backend/executor/nodeMaterial.c @@ -24,6 +24,9 @@ #include "executor/executor.h" #include "executor/nodeMaterial.h" #include "miscadmin.h" +#ifdef PGXC +#include "pgxc/pgxc.h" +#endif /* ---------------------------------------------------------------- * ExecMaterial @@ -56,9 +59,24 @@ ExecMaterial(MaterialState *node) /* * If first time through, and we need a tuplestore, initialize it. */ +#ifdef PGXC + /* + * For PGXC, temporarily always create the storage. + * This allows us to easily use the same connection to + * in multiple steps of the plan. + */ + if ((IS_PGXC_COORDINATOR && tuplestorestate == NULL) + || (IS_PGXC_DATANODE && tuplestorestate == NULL && node->eflags != 0)) +#else if (tuplestorestate == NULL && node->eflags != 0) +#endif { tuplestorestate = tuplestore_begin_heap(true, false, work_mem); +#ifdef PGXC + if (IS_PGXC_COORDINATOR) + /* Note that we will rescan these results */ + node->eflags |= EXEC_FLAG_REWIND; +#endif tuplestore_set_eflags(tuplestorestate, node->eflags); if (node->eflags & EXEC_FLAG_MARK) { @@ -73,6 +91,26 @@ ExecMaterial(MaterialState *node) Assert(ptrno == 1); } node->tuplestorestate = tuplestorestate; + +#ifdef PGXC + if (IS_PGXC_COORDINATOR) + { + TupleTableSlot *outerslot; + PlanState *outerNode = outerPlanState(node); + + /* We want to always materialize first temporarily in PG-XC */ + while (!node->eof_underlying) + { + outerslot = ExecProcNode(outerNode); + if (TupIsNull(outerslot)) + node->eof_underlying = true; + else + /* Append a copy of the returned tuple to tuplestore. */ + tuplestore_puttupleslot(tuplestorestate, outerslot); + } + tuplestore_rescan(node->tuplestorestate); + } +#endif } /* diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index a1aa660..0e9aa43 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -17,6 +17,7 @@ #include <math.h> +#include "catalog/pg_namespace.h" #include "nodes/nodeFuncs.h" #ifdef OPTIMIZER_DEBUG #include "nodes/print.h" @@ -33,7 +34,11 @@ #include "optimizer/var.h" #include "parser/parse_clause.h" #include "parser/parsetree.h" +#ifdef PGXC +#include "pgxc/pgxc.h" +#endif #include "rewrite/rewriteManip.h" +#include "utils/lsyscache.h" /* These parameters are set by GUC */ @@ -254,6 +259,18 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte) * least one dimension of cost or sortedness. */ +#ifdef PGXC + /* + * If we are on the coordinator, we always want to use + * the remote query path unless it is a pg_catalog table. + */ + if (IS_PGXC_COORDINATOR + && get_rel_namespace(rte->relid) != PG_CATALOG_NAMESPACE) + add_path(rel, create_remotequery_path(root, rel)); + else + { +#endif + /* Consider sequential scan */ add_path(rel, create_seqscan_path(root, rel)); @@ -262,6 +279,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte) /* Consider TID scans */ create_tidscan_paths(root, rel); +#ifdef PGXC + } +#endif /* Now find the cheapest of the paths for this rel */ set_cheapest(rel); diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 6e5c251..337f17b 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -32,6 +32,9 @@ #include "optimizer/var.h" #include "parser/parse_clause.h" #include "parser/parsetree.h" +#ifdef PGXC +#include "pgxc/planner.h" +#endif #include "utils/lsyscache.h" @@ -66,6 +69,10 @@ static CteScan *create_ctescan_plan(PlannerInfo *root, Path *best_path, List *tlist, List *scan_clauses); static WorkTableScan *create_worktablescan_plan(PlannerInfo *root, Path *best_path, List *tlist, List *scan_clauses); +#ifdef PGXC +static RemoteQuery *create_remotequery_plan(PlannerInfo *root, Path *best_path, + List *tlist, List *scan_clauses); +#endif static NestLoop *create_nestloop_plan(PlannerInfo *root, NestPath *best_path, Plan *outer_plan, Plan *inner_plan); static MergeJoin *create_mergejoin_plan(PlannerInfo *root, MergePath *best_path, @@ -101,6 +108,10 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual, Index scanrelid, int ctePlanId, int cteParam); static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual, Index scanrelid, int wtParam); +#ifdef PGXC +static RemoteQuery *make_remotequery(List *qptlist, RangeTblEntry *rte, + List *qpqual, Index scanrelid); +#endif static BitmapAnd *make_bitmap_and(List *bitmapplans); static BitmapOr *make_bitmap_or(List *bitmapplans); static NestLoop *make_nestloop(List *tlist, @@ -162,6 +173,9 @@ create_plan(PlannerInfo *root, Path *best_path) case T_ValuesScan: case T_CteScan: case T_WorkTableScan: +#ifdef PGXC + case T_RemoteQuery: +#endif plan = create_scan_plan(root, best_path); break; case T_HashJoin: @@ -207,6 +221,9 @@ create_scan_plan(PlannerInfo *root, Path *best_path) List *tlist; List *scan_clauses; Plan *plan; +#ifdef PGXC + Plan *matplan; +#endif /* * For table scans, rather than using the relation targetlist (which is @@ -298,6 +315,23 @@ create_scan_plan(PlannerInfo *root, Path *best_path) scan_clauses); break; +#ifdef PGXC + case T_RemoteQuery: + plan = (Plan *) create_remotequery_plan(root, + best_path, + tlist, + scan_clauses); + + /* + * Insert a materialization plan above this temporarily + * until we better handle multiple steps using the same connection. + */ + matplan = (Plan *) make_material(plan); + copy_plan_costsize(matplan, plan); + matplan->total_cost += cpu_tuple_cost * matplan->plan_rows; + plan = matplan; + break; +#endif default: elog(ERROR, "unrecognized node type: %d", (int) best_path->pathtype); @@ -420,6 +454,9 @@ disuse_physical_tlist(Plan *plan, Path *path) case T_ValuesScan: case T_CteScan: case T_WorkTableScan: +#ifdef PGXC + case T_RemoteQuery: +#endif plan->targetlist = build_relation_tlist(path->parent); break; default: @@ -1544,6 +1581,46 @@ create_worktablescan_plan(PlannerInfo *root, Path *best_path, return scan_plan; } +#ifdef PGXC +/* + * create_remotequery_plan + * Returns a remotequery plan for the base relation scanned by 'best_path' + * with restriction clauses 'scan_clauses' and targetlist 'tlist'. + */ +static RemoteQuery * +create_remotequery_plan(PlannerInfo *root, Path *best_path, + List *tlist, List *scan_clauses) +{ + RemoteQuery *scan_plan; + Index scan_relid = best_path->parent->relid; + RangeTblEntry *rte; + + + Assert(scan_relid > 0); + rte = planner_rt_fetch(scan_relid, root); + Assert(best_path->parent->rtekind == RTE_RELATION); + Assert(rte->rtekind == RTE_RELATION); + + /* Sort clauses into best execution order */ + scan_clauses = order_qual_clauses(root, scan_clauses); + + /* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */ + scan_clauses = extract_actual_clauses(scan_clauses, false); + + scan_plan = make_remotequery(tlist, + rte, + scan_clauses, + scan_relid); + + copy_path_costsize(&scan_plan->scan.plan, best_path); + + /* PGXCTODO - get better estimates */ + scan_plan->scan.plan.plan_rows = 1000; + + return scan_plan; +} +#endif + /***************************************************************************** * @@ -2615,6 +2692,28 @@ make_worktablescan(List *qptlist, return node; } +#ifdef PGXC +static RemoteQuery * +make_remotequery(List *qptlist, + RangeTblEntry *rte, + List *qpqual, + Index scanrelid) +{ + RemoteQuery *node = makeNode(RemoteQuery); + Plan *plan = &node->scan.plan; + + /* cost should be inserted by caller */ + plan->targetlist = qptlist; + plan->qual = qpqual; + plan->lefttree = NULL; + plan->righttree = NULL; + node->scan.scanrelid = scanrelid; + node->read_only = true; + + return node; +} +#endif + Append * make_append(List *appendplans, bool isTarget, List *tlist) { diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c index d839c28..cab7fb4 100644 --- a/src/backend/optimizer/plan/setrefs.c +++ b/src/backend/optimizer/plan/setrefs.c @@ -22,6 +22,9 @@ #include "optimizer/clauses.h" #include "optimizer/planmain.h" #include "optimizer/tlist.h" +#ifdef PGXC +#include "pgxc/planner.h" +#endif #include "parser/parsetree.h" #include "utils/lsyscache.h" #include "utils/syscache.h" @@ -373,6 +376,19 @@ set_plan_refs(PlannerGlobal *glob, Plan *plan, int rtoffset) fix_scan_list(glob, splan->scan.plan.qual, rtoffset); } break; +#ifdef PGXC + case T_RemoteQuery: + { + RemoteQuery *splan = (RemoteQuery *) plan; + + splan->scan.scanrelid += rtoffset; + splan->scan.plan.targetlist = + fix_scan_list(glob, splan->scan.plan.targetlist, rtoffset); + splan->scan.plan.qual = + fix_scan_list(glob, splan->scan.plan.qual, rtoffset); + } + break; +#endif case T_NestLoop: case T_MergeJoin: case T_HashJoin: diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c index bde351d..e5c6dac 100644 --- a/src/backend/optimizer/plan/subselect.c +++ b/src/backend/optimizer/plan/subselect.c @@ -1963,6 +1963,12 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params) bms_add_member(context.paramids, ((WorkTableScan *) plan)->wtParam); break; +#ifdef PGXC + case T_RemoteQuery: + //PGXCTODO + context.paramids = bms_add_members(context.paramids, valid_params); + break; +#endif case T_Append: { diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index b7c3d3c..cbf7618 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -1310,6 +1310,28 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel) return pathnode; } +#ifdef PGXC +/* + * create_remotequery_path + * Creates a path corresponding to a scan of a remote query, + * returning the pathnode. + */ +Path * +create_remotequery_path(PlannerInfo *root, RelOptInfo *rel) +{ + Path *pathnode = makeNode(Path); + + pathnode->pathtype = T_RemoteQuery; + pathnode->parent = rel; + pathnode->pathkeys = NIL; /* result is always unordered */ + + // PGXCTODO - set cost properly + cost_seqscan(pathnode, root, rel); + + return pathnode; +} +#endif + /* * create_nestloop_path * Creates a pathnode corresponding to a nestloop join between two diff --git a/src/backend/pgxc/plan/planner.c b/src/backend/pgxc/plan/planner.c index 1dcfc29..c8911b7 100644 --- a/src/backend/pgxc/plan/planner.c +++ b/src/backend/pgxc/plan/planner.c @@ -25,6 +25,7 @@ #include "nodes/nodes.h" #include "nodes/parsenodes.h" #include "optimizer/clauses.h" +#include "optimizer/planmain.h" #include "optimizer/planner.h" #include "optimizer/tlist.h" #include "parser/parse_agg.h" @@ -141,7 +142,7 @@ bool StrictSelectChecking = false; static Exec_Nodes *get_plan_nodes(Query *query, bool isRead); static bool get_plan_nodes_walker(Node *query_node, XCWalkerContext *context); static bool examine_conditions_walker(Node *expr_node, XCWalkerContext *context); - +static int handle_limit_offset(RemoteQuery *query_step, Query *query, PlannedStmt *plan_stmt); /* * True if both lists contain only one node and are the same @@ -1528,16 +1529,6 @@ get_simple_aggregates(Query * query) simple_agg_list = lappend(simple_agg_list, simple_agg); } - else - { - /* - * PGXCTODO relax this limit after adding GROUP BY support - * then support expressions of aggregates - */ - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("Query is not yet supported")))); - } column_pos++; } } @@ -1629,7 +1620,7 @@ reconstruct_step_query(List *rtable, bool has_order_by, List *extra_sort, { List *context; bool useprefix; - List *sub_tlist = step->plan.targetlist; + List *sub_tlist = step->scan.plan.targetlist; ListCell *l; StringInfo buf = makeStringInfo(); char *sql; @@ -1737,7 +1728,7 @@ make_simple_sort_from_sortclauses(Query *query, RemoteQuery *step) { List *sortcls = query->sortClause; List *distinctcls = query->distinctClause; - List *sub_tlist = step->plan.targetlist; + List *sub_tlist = step->scan.plan.targetlist; SimpleSort *sort; SimpleDistinct *distinct; ListCell *l; @@ -1978,6 +1969,100 @@ make_simple_sort_from_sortclauses(Query *query, RemoteQuery *step) } /* + * Special case optimization. + * Handle LIMIT and OFFSET for single-step queries on multiple nodes. + * + * Return non-zero if we need to fall back to the standard plan. + */ +static int +handle_limit_offset(RemoteQuery *query_step, Query *query, PlannedStmt *plan_stmt) +{ + + /* check if no special handling needed */ + if (query_step && query_step->exec_nodes && + list_length(query_step->exec_nodes->nodelist) <= 1) + return 0; + + /* if order by and limit are present, do not optimize yet */ + if ((query->limitCount || query->limitOffset) && query->sortClause) + return 1; + + /* + * Note that query_step->is_single_step is set to true, but + * it is ok even if we add limit here. + * If OFFSET is set, we strip the final offset value and add + * it to the LIMIT passed down. If there is an OFFSET and no + * LIMIT, we just strip off OFFSET. + */ + if (query->limitOffset) + { + int64 newLimit = 0; + char *newpos; + char *pos; + char *limitpos; + char *newQuery; + char *newchar; + char *c; + + pos = NULL; + newpos = NULL; + + if (query->limitCount) + { + for (pos = query_step->sql_statement, newpos = pos; newpos != NULL; ) + { + pos = newpos; + newpos = strcasestr(pos+1, "LIMIT"); + } + limitpos = pos; + + if (IsA(query->limitCount, Const)) + newLimit = DatumGetInt64(((Const *) query->limitCount)->constvalue); + else + return 1; + } + + for (pos = query_step->sql_statement, newpos = pos; newpos != NULL; ) + { + pos = newpos; + newpos = strcasestr(pos+1, "OFFSET"); + } + + if (limitpos && limitpos < pos) + pos = limitpos; + + if (IsA(query->limitOffset, Const)) + newLimit += DatumGetInt64(((Const *) query->limitOffset)->constvalue); + else + return 1; + + if (!pos || pos == query_step->sql_statement) + elog(ERROR, "Could not handle LIMIT/OFFSET"); + + newQuery = (char *) palloc(strlen(query_step->sql_statement)+1); + newchar = newQuery; + + /* copy up until position where we found clause */ + for (c = &query_step->sql_statement[0]; c != pos && *c != '\0'; *newchar++ = *c++); + + if (query->limitCount) + sprintf(newchar, "LIMIT %I64d", newLimit); + else + *newchar = '\0'; + + pfree(query_step->sql_statement); + query_step->sql_statement = newQuery; + } + + /* Now add a limit execution node at the top of the plan */ + plan_stmt->planTree = (Plan *) make_limit(plan_stmt->planTree, + query->limitOffset, query->limitCount, 0, 0); + + return 0; +} + + +/* * Build up a QueryPlan to execute on. * * For the prototype, there will only be one step, @@ -1997,6 +2082,7 @@ pgxc_planner(Query *query, int cursorOptions, ParamListInfo boundParams) Plan *standardPlan = result->planTree; RemoteQuery *query_step = makeNode(RemoteQuery); + query_step->is_single_step = false; query_step->sql_statement = pstrdup(query->sql_statement); query_step->exec_nodes = NULL; query_step->combine_type = COMBINE_TYPE_NONE; @@ -2020,21 +2106,6 @@ pgxc_planner(Query *query, int cursorOptions, ParamListInfo boundParams) ereport(ERROR, (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), (errmsg("INTO clause not yet supported")))); - - if (query->setOperations) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("UNION, INTERSECT and EXCEPT are not yet supported")))); - - if (query->hasRecursive) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("WITH RECURSIVE not yet supported")))); - - if (query->hasWindowFuncs) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("Window functions not yet supported")))); /* fallthru */ case T_InsertStmt: case T_UpdateStmt: @@ -2043,14 +2114,32 @@ pgxc_planner(Query *query, int cursorOptions, ParamListInfo boundParams) if (query_step->exec_nodes == NULL) { + /* Do not yet allow multi-node correlated UPDATE or DELETE */ + if ((query->nodeTag == T_UpdateStmt || query->nodeTag == T_DeleteStmt)) + { + ereport(ERROR, + (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), + (errmsg("Complex and correlated UPDATE and DELETE not yet supported")))); + } + /* - * Processing guery against catalog tables, restore - * standard plan + * Processing guery against catalog tables, or multi-step command. + * Restore standard plan */ result->planTree = standardPlan; return result; } + /* Do not yet allow multi-node correlated UPDATE or DELETE */ + if ((query->nodeTag == T_UpdateStmt || query->nodeTag == T_DeleteStmt) + && !query_step->exec_nodes + && list_length(query->rtable) > 1) + { + result->planTree = standardPlan; + return result; + } + + query_step->is_single_step = true; /* * PGXCTODO * When Postgres runs insert into t (a) values (1); against table @@ -2064,7 +2153,7 @@ pgxc_planner(Query *query, int cursorOptions, ParamListInfo boundParams) * then call standard planner and take targetList from the plan * generated by Postgres. */ - query_step->plan.targetlist = standardPlan->targetlist; + query_step->scan.plan.targetlist = standardPlan->targetlist; if (query_step->exec_nodes) query_step->combine_type = get_plan_combine_type( @@ -2075,39 +2164,36 @@ pgxc_planner(Query *query, int cursorOptions, ParamListInfo boundParams) query_step->simple_aggregates = get_simple_aggregates(query); /* - * Add sortring to the step + * Add sorting to the step */ if (list_length(query_step->exec_nodes->nodelist) > 1 && (query->sortClause || query->distinctClause)) make_simple_sort_from_sortclauses(query, query_step); - /* - * PG-XC cannot yet support some variations of SQL statements. - * We perform some checks to at least catch common cases - */ + /* Handle LIMIT and OFFSET for single-step queries on multiple nodes*/ + if (handle_limit_offset(query_step, query, result)) + { + /* complicated expressions, just fallback to standard plan */ + result->planTree = standardPlan; + return result; + } + /* + * Use standard plan if we have more than one data node with either + * group by, hasWindowFuncs, or hasRecursive + */ /* - * Check if we have multiple nodes and an unsupported clause. This - * is temporary until we expand supported SQL + * PGXCTODO - this could be improved to check if the first + * group by expression is the partitioning column, in which + * case it is ok to treat as a single step. */ - if (query->nodeTag == T_SelectStmt) + if (query->nodeTag == T_SelectStmt + && query_step->exec_nodes + && list_length(query_step->exec_nodes->nodelist) > 1 + && (query->groupClause || query->hasWindowFuncs || query->hasRecursive)) { - if (StrictStatementChecking && query_step->exec_nodes - && list_length(query_step->exec_nodes->nodelist) > 1) - { - /* - * PGXCTODO - this could be improved to check if the first - * group by expression is the partitioning column - */ - if (query->groupClause) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("Multi-node GROUP BY not yet supported")))); - if (query->limitCount && StrictSelectChecking) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("Multi-node LIMIT not yet supported")))); - } + result->planTree = standardPlan; + return result; } break; default: diff --git a/src/backend/pgxc/pool/Makefile b/src/backend/pgxc/pool/Makefile index e875303..c7e950a 100644 --- a/src/backend/pgxc/pool/Makefile +++ b/src/backend/pgxc/pool/Makefile @@ -14,6 +14,6 @@ subdir = src/backend/pgxc/pool top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global -OBJS = datanode.o execRemote.o poolmgr.o poolcomm.o +OBJS = datanode.o execRemote.o poolmgr.o poolcomm.o postgresql_fdw.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/pgxc/pool/execRemote.c b/src/backend/pgxc/pool/execRemote.c index 0f16c51..43569e0 100644 --- a/src/backend/pgxc/pool/execRemote.c +++ b/src/backend/pgxc/pool/execRemote.c @@ -30,6 +30,8 @@ #include "utils/tuplesort.h" #include "utils/snapmgr.h" +extern char *deparseSql(RemoteQueryState *scanstate); + /* * Buffer size does not affect performance significantly, just do not allow * connection buffer grows infinitely @@ -1461,8 +1463,8 @@ DataNodeCopyBegin(const char *query, List *nodelist, Snapshot snapshot, bool is_ { if (need_tran) DataNodeCopyFinish(connections, 0, COMBINE_TYPE_NONE); - else - if (!PersistentConnections) release_handles(); + else if (!PersistentConnections) + release_handles(); } pfree(connections); @@ -1812,21 +1814,44 @@ ExecCountSlotsRemoteQuery(RemoteQuery *node) RemoteQueryState * ExecInitRemoteQuery(RemoteQuery *node, EState *estate, int eflags) { - RemoteQueryState *remotestate; + RemoteQueryState *remotestate; + Relation currentRelation; + remotestate = CreateResponseCombiner(0, node->combine_type); remotestate->ss.ps.plan = (Plan *) node; remotestate->ss.ps.state = estate; remotestate->simple_aggregates = node->simple_aggregates; + remotestate->ss.ps.qual = (List *) + ExecInitExpr((Expr *) node->scan.plan.qual, + (PlanState *) remotestate); + ExecInitResultTupleSlot(estate, &remotestate->ss.ps); - if (node->plan.targetlist) + if (node->scan.plan.targetlist) { - TupleDesc typeInfo = ExecCleanTypeFromTL(node->plan.targetlist, false); + TupleDesc typeInfo = ExecCleanTypeFromTL(node->scan.plan.targetlist, false); ExecSetSlotDescriptor(remotestate->ss.ps.ps_ResultTupleSlot, typeInfo); } ExecInitScanTupleSlot(estate, &remotestate->ss); + + /* + * Initialize scan relation. get the relation object id from the + * relid'th entry in the range table, open that relation and acquire + * appropriate lock on it. + * This is needed for deparseSQL + * We should remove these lines once we plan and deparse earlier. + */ + if (!node->is_single_step) + { + currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid); + remotestate->ss.ss_currentRelation = currentRelation; + ExecAssignScanType(&remotestate->ss, RelationGetDescr(currentRelation)); + } + + remotestate->ss.ps.ps_TupFromTlist = false; + /* * Tuple description for the scan slot will be set on runtime from * a RowDescription message @@ -1991,7 +2016,6 @@ TupleTableSlot * ExecRemoteQuery(RemoteQueryState *node) { RemoteQuery *step = (RemoteQuery *) node->ss.ps.plan; - EState *estate = node->ss.ps.state; TupleTableSlot *resultslot = node->ss.ps.ps_ResultTupleSlot; TupleTableSlot *scanslot = node->ss.ss_ScanTupleSlot; bool have_tuple = false; @@ -2092,6 +2116,11 @@ ExecRemoteQuery(RemoteQueryState *node) data_node_begin(new_count, new_connections, gxid); } + /* Get the SQL string */ + /* only do if not single step */ + if (!step->is_single_step) + step->sql_statement = deparseSql(node); + /* See if we have a primary nodes, execute on it first before the others */ if (primaryconnection) { @@ -2427,12 +2456,35 @@ ExecEndRemoteQuery(RemoteQueryState *node) if (outerPlanState(node)) ExecEndNode(outerPlanState(node)); + if (node->ss.ss_currentRelation) + ExecCloseScanRelation(node->ss.ss_currentRelation); + if (node->tmp_ctx) MemoryContextDelete(node->tmp_ctx); CloseCombiner(node); } + +/* ---------------------------------------------------------------- + * ExecRemoteQueryReScan + * + * Rescans the relation. + * ---------------------------------------------------------------- + */ +void +ExecRemoteQueryReScan(RemoteQueryState *node, ExprContext *exprCtxt) +{ + /* At the moment we materialize results for multi-step queries, + * so no need to support rescan. + // PGXCTODO - rerun Init? + //node->routine->ReOpen(node); + + //ExecScanReScan((ScanState *) node); + */ +} + + /* * Execute utility statement on multiple data nodes * It does approximately the same as diff --git a/src/backend/pgxc/pool/postgresql_fdw.c b/src/backend/pgxc/pool/postgresql_fdw.c new file mode 100644 index 0000000..9e418be --- /dev/null +++ b/src/backend/pgxc/pool/postgresql_fdw.c @@ -0,0 +1,335 @@ +/*------------------------------------------------------------------------- + * + * postgresql_fdw.c + * foreign-data wrapper for PostgreSQL + * + * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "catalog/pg_operator.h" +#include "catalog/pg_proc.h" +#include "funcapi.h" +//#include "libpq-fe.h" +#include "mb/pg_wchar.h" +#include "miscadmin.h" +#include "nodes/nodeFuncs.h" +#include "nodes/makefuncs.h" +#include "optimizer/clauses.h" +#include "parser/scansup.h" +#include "pgxc/execRemote.h" +#include "utils/builtins.h" +#include "utils/lsyscache.h" +#include "utils/memutils.h" +#include "utils/syscache.h" + +//#include "dblink.h" + +#define DEBUG_FDW + +/* + * WHERE caluse optimization level + */ +#define EVAL_QUAL_LOCAL 0 /* evaluate none in foreign, all in local */ +#define EVAL_QUAL_BOTH 1 /* evaluate some in foreign, all in local */ +#define EVAL_QUAL_FOREIGN 2 /* evaluate some in foreign, rest in local */ + +#define OPTIMIZE_WHERE_CLAUSE EVAL_QUAL_FOREIGN + + + +/* deparse SQL from the request */ +static bool is_immutable_func(Oid funcid); +static bool is_foreign_qual(ExprState *state); +static bool foreign_qual_walker(Node *node, void *context); +char *deparseSql(RemoteQueryState *scanstate); + + +/* + * Check whether the function is IMMUTABLE. + */ +static bool +is_immutable_func(Oid funcid) +{ + HeapTuple tp; + bool isnull; + Datum datum; + + tp = SearchSysCache(PROCOID, ObjectIdGetDatum(funcid), 0, 0, 0); + if (!HeapTupleIsValid(tp)) + elog(ERROR, "cache lookup failed for function %u", funcid); + +#ifdef DEBUG_FDW + /* print function name and its immutability */ + { + char *proname; + datum = SysCacheGetAttr(PROCOID, tp, Anum_pg_proc_proname, &isnull); + proname = pstrdup(DatumGetName(datum)->data); + elog(DEBUG1, "func %s(%u) is%s immutable", proname, funcid, + (DatumGetChar(datum) == PROVOLATILE_IMMUTABLE) ? "" : " not"); + pfree(proname); + } +#endif + + datum = SysCacheGetAttr(PROCOID, tp, Anum_pg_proc_provolatile, &isnull); + ReleaseSysCache(tp); + + return (DatumGetChar(datum) == PROVOLATILE_IMMUTABLE); +} + +/* + * Check whether the ExprState node should be evaluated in foreign server. + * + * An expression which consists of expressions below will be evaluated in + * the foreign server. + * - constant value + * - variable (foreign table column) + * - external parameter (parameter of prepared statement) + * - array + * - bool expression (AND/OR/NOT) + * - NULL test (IS [NOT] NULL) + * - operator + * - IMMUTABLE only + * - It is required that the meaning of the operator be the same as the + * local server in the foreign server. + * - function + * - IMMUTABLE only + * - It is required that the meaning of the operator be the same as the + * local server in the foreign server. + * - scalar array operator (ANY/ALL) + */ +static bool +is_foreign_qual(ExprState *state) +{ + return !foreign_qual_walker((Node *) state->expr, NULL); +} + +/* + * return true if node cannot be evaluatated in foreign server. + */ +static bool +foreign_qual_walker(Node *node, void *context) +{ + if (node == NULL) + return false; + + switch (nodeTag(node)) + { + case T_Param: + /* TODO: pass internal parameters to the foreign server */ + if (((Param *) node)->paramkind != PARAM_EXTERN) + return true; + break; + case T_DistinctExpr: + case T_OpExpr: + /* + * An operator which uses IMMUTABLE function can be evaluated in + * foreign server . It is not necessary to worry about oprrest + * and oprjoin here because they are invoked by planner but not + * executor. DistinctExpr is a typedef of OpExpr. + */ + if (!is_immutable_func(((OpExpr*) node)->opfuncid)) + return true; + break; + case T_ScalarArrayOpExpr: + if (!is_immutable_func(((ScalarArrayOpExpr*) node)->opfuncid)) + return true; + break; + case T_FuncExpr: + /* IMMUTABLE function can be evaluated in foreign server */ + if (!is_immutable_func(((FuncExpr*) node)->funcid)) + return true; + break; + case T_TargetEntry: + case T_PlaceHolderVar: + case T_AppendRelInfo: + case T_PlaceHolderInfo: + /* TODO: research whether those complex nodes are evaluatable. */ + return true; + default: + break; + } + + return expression_tree_walker(node, foreign_qual_walker, context); +} + +/* + * Deparse SQL string from query request. + * + * The expressions in Plan.qual are deparsed when it satisfies is_foreign_qual() + * and removed. + */ +char * +deparseSql(RemoteQueryState *scanstate) +{ + EState *estate = scanstate->ss.ps.state; + bool prefix; + List *context; + StringInfoData sql; + RemoteQuery *scan; + RangeTblEntry *rte; + Oid nspid; + char *nspname; + char *relname; + const char *nspname_q; + const char *relname_q; + const char *aliasname_q; + int i; + TupleDesc tupdesc; + bool first; + +elog(DEBUG2, "%s(%u) called", __FUNCTION__, __LINE__); + + /* extract RemoteQuery and RangeTblEntry */ + scan = (RemoteQuery *)scanstate->ss.ps.plan; + rte = list_nth(estate->es_range_table, scan->scan.scanrelid - 1); + + /* prepare to deparse plan */ + initStringInfo(&sql); + context = deparse_context_for_plan((Node *)scan, NULL, + estate->es_range_table, NULL); + + /* + * Scanning multiple relations in a RemoteQuery node is not supported. + */ + prefix = false; +#if 0 + prefix = list_length(estate->es_range_table) > 1; +#endif + + /* Get quoted names of schema, table and alias */ + nspid = get_rel_namespace(rte->relid); + nspname = get_namespace_name(nspid); + relname = get_rel_name(rte->relid); + nspname_q = quote_identifier(nspname); + relname_q = quote_identifier(relname); + aliasname_q = quote_identifier(rte->eref->aliasname); + + /* deparse SELECT clause */ + appendStringInfo(&sql, "SELECT "); + + /* + * TODO: omit (deparse to "NULL") columns which are not used in the + * original SQL. + * + * We must parse nodes parents of this RemoteQuery node to determine unused + * columns because some columns may be used only in parent Sort/Agg/Limit + * nodes. + */ + tupdesc = scanstate->ss.ss_currentRelation->rd_att; + first = true; + for (i = 0; i < tupdesc->natts; i++) + { + /* skip dropped attributes */ + if (tupdesc->attrs[i]->attisdropped) + continue; + + if (!first) + appendStringInfoString(&sql, ", "); + + if (prefix) + appendStringInfo(&sql, "%s.%s", + aliasname_q, tupdesc->attrs[i]->attname.data); + else + appendStringInfo(&sql, "%s", tupdesc->attrs[i]->attname.data); + first = false; + } + + /* if target list is composed only of system attributes, add dummy column */ + if (first) + appendStringInfo(&sql, "NULL"); + + /* deparse FROM clause */ + appendStringInfo(&sql, " FROM "); + /* + * XXX: should use GENERIC OPTIONS like 'foreign_relname' or something for + * the foreign table name instead of the local name ? + */ + appendStringInfo(&sql, "%s.%s %s", nspname_q, relname_q, aliasname_q); + pfree(nspname); + pfree(relname); + if (nspname_q != nspname_q) + pfree((char *) nspname_q); + if (relname_q != relname_q) + pfree((char *) relname_q); + if (aliasname_q != rte->eref->aliasname) + pfree((char *) aliasname_q); + + /* + * deparse WHERE cluase + * + * The expressions which satisfy is_foreign_qual() are deparsed into WHERE + * clause of result SQL string, and they could be removed from qual of + * PlanState to avoid duplicate evaluation at ExecScan(). + * + * The Plan.qual is never changed, so multiple use of the Plan with + * PREPARE/EXECUTE work properly. + */ +#if OPTIMIZE_WHERE_CLAUSE > EVAL_QUAL_LOCAL + if (scanstate->ss.ps.plan->qual) + { + List *local_qual = NIL; + List *foreign_qual = NIL; + List *foreign_expr = NIL; + ListCell *lc; + + /* + * Divide qual of PlanState into two lists, one for local evaluation + * and one for foreign evaluation. + */ + foreach (lc, scanstate->ss.ps.qual) + { + ExprState *state = lfirst(lc); + + if (is_foreign_qual(state)) + { + elog(DEBUG1, "foreign qual: %s", nodeToString(state->expr)); + foreign_qual = lappend(foreign_qual, state); + foreign_expr = lappend(foreign_expr, state->expr); + } + else + { + elog(DEBUG1, "local qual: %s", nodeToString(state->expr)); + local_qual = lappend(local_qual, state); + } + } +#if OPTIMIZE_WHERE_CLAUSE == EVAL_QUAL_FOREIGN + /* + * If the optimization level is EVAL_QUAL_FOREIGN, replace the original + * qual with the list of ExprStates which should be evaluated in the + * local server. + */ + scanstate->ss.ps.qual = local_qual; +#endif + + /* + * Deparse quals to be evaluated in the foreign server if any. + * TODO: modify deparse_expression() to deparse conditions which use + * internal parameters. + */ + if (foreign_expr != NIL) + { + Node *node; + node = (Node *) make_ands_explicit(foreign_expr); + appendStringInfo(&sql, " WHERE "); + appendStringInfo(&sql, + deparse_expression(node, context, prefix, false)); + /* + * The contents of the list MUST NOT be free-ed because they are + * referenced from Plan.qual list. + */ + list_free(foreign_expr); + } + } +#endif + + elog(DEBUG1, "deparsed SQL is \"%s\"", sql.data); + + return sql.data; +} + diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 608755f..a6f4767 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -198,8 +198,6 @@ static void log_disconnections(int code, Datum arg); #ifdef PGXC /* PGXC_DATANODE */ -static void pgxc_transaction_stmt (Node *parsetree); - /* ---------------------------------------------------------------- * PG-XC routines * ---------------------------------------------------------------- diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index d1c01da..0c6208c 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -306,7 +306,6 @@ ProcessUtility(Node *parsetree, case TRANS_STMT_START: { ListCell *lc; - #ifdef PGXC if (IS_PGXC_COORDINATOR) DataNodeBegin(); @@ -329,10 +328,6 @@ ProcessUtility(Node *parsetree, break; case TRANS_STMT_COMMIT: -#ifdef PGXC - if (IS_PGXC_COORDINATOR) - DataNodeCommit(); -#endif if (!EndTransactionBlock()) { /* report unsuccessful commit in completionTag */ @@ -361,10 +356,6 @@ ProcessUtility(Node *parsetree, break; case TRANS_STMT_ROLLBACK: -#ifdef PGXC - if (IS_PGXC_COORDINATOR) - DataNodeBegin(); -#endif UserAbortTransactionBlock(); break; @@ -1055,21 +1046,16 @@ ProcessUtility(Node *parsetree, case T_ExplainStmt: ExplainQuery((ExplainStmt *) parsetree, queryString, params, dest); -#ifdef PGXC - if (IS_PGXC_COORDINATOR) - { - Exec_Nodes *nodes = (Exec_Nodes *) palloc0(sizeof(Exec_Nodes)); - nodes->nodelist = GetAnyDataNode(); - ExecUtilityStmtOnNodes(queryString, nodes, false); - } -#endif break; case T_VariableSetStmt: ExecSetVariableStmt((VariableSetStmt *) parsetree); #ifdef PGXC +/* PGXCTODO - this currently causes an assertion failure. + We should change when we add SET handling properly if (IS_PGXC_COORDINATOR) ExecUtilityStmtOnNodes(queryString, NULL, false); +*/ #endif break; diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h index 11be226..29aed38 100644 --- a/src/include/optimizer/cost.h +++ b/src/include/optimizer/cost.h @@ -79,6 +79,9 @@ extern void cost_functionscan(Path *path, PlannerInfo *root, RelOptInfo *baserel); extern void cost_valuesscan(Path *path, PlannerInfo *root, RelOptInfo *baserel); +#ifdef PGXC +extern void cost_remotequery(Path *path, PlannerInfo *root, RelOptInfo *baserel); +#endif extern void cost_ctescan(Path *path, PlannerInfo *root, RelOptInfo *baserel); extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm); extern void cost_sort(Path *path, PlannerInfo *root, diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h index 0f4c52e..05efcaf 100644 --- a/src/include/optimizer/pathnode.h +++ b/src/include/optimizer/pathnode.h @@ -56,6 +56,9 @@ extern Path *create_functionscan_path(PlannerInfo *root, RelOptInfo *rel); extern Path *create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel); extern Path *create_ctescan_path(PlannerInfo *root, RelOptInfo *rel); extern Path *create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel); +#ifdef PGXC +extern Path *create_remotequery_path(PlannerInfo *root, RelOptInfo *rel); +#endif extern NestPath *create_nestloop_path(PlannerInfo *root, RelOptInfo *joinrel, diff --git a/src/include/pgxc/execRemote.h b/src/include/pgxc/execRemote.h index b7faa7d..143c8fa 100644 --- a/src/include/pgxc/execRemote.h +++ b/src/include/pgxc/execRemote.h @@ -95,6 +95,7 @@ extern void ExecRemoteUtility(RemoteQuery *node); extern int handle_response(DataNodeHandle * conn, RemoteQueryState *combiner); extern bool FetchTuple(RemoteQueryState *combiner, TupleTableSlot *slot); -extern int primary_data_node; +extern void ExecRemoteQueryReScan(RemoteQueryState *node, ExprContext *exprCtxt); -#endif \ No newline at end of file +extern int primary_data_node; +#endif diff --git a/src/include/pgxc/planner.h b/src/include/pgxc/planner.h index bf8f224..346dd65 100644 --- a/src/include/pgxc/planner.h +++ b/src/include/pgxc/planner.h @@ -58,7 +58,8 @@ typedef struct */ typedef struct { - Plan plan; + Scan scan; + bool is_single_step; /* special case, skip extra work */ char *sql_statement; Exec_Nodes *exec_nodes; CombineType combine_type; ----------------------------------------------------------------------- Summary of changes: src/backend/commands/explain.c | 31 +++ src/backend/commands/tablecmds.c | 14 ++ src/backend/executor/execAmi.c | 8 + src/backend/executor/nodeMaterial.c | 38 ++++ src/backend/optimizer/path/allpaths.c | 20 ++ src/backend/optimizer/plan/createplan.c | 99 +++++++++ src/backend/optimizer/plan/setrefs.c | 16 ++ src/backend/optimizer/plan/subselect.c | 6 + src/backend/optimizer/util/pathnode.c | 22 ++ src/backend/pgxc/plan/planner.c | 196 +++++++++++++----- src/backend/pgxc/pool/Makefile | 2 +- src/backend/pgxc/pool/execRemote.c | 64 ++++++- src/backend/pgxc/pool/postgresql_fdw.c | 335 +++++++++++++++++++++++++++++++ src/backend/tcop/postgres.c | 2 - src/backend/tcop/utility.c | 20 +-- src/include/optimizer/cost.h | 3 + src/include/optimizer/pathnode.h | 3 + src/include/pgxc/execRemote.h | 5 +- src/include/pgxc/planner.h | 3 +- 19 files changed, 803 insertions(+), 84 deletions(-) create mode 100644 src/backend/pgxc/pool/postgresql_fdw.c hooks/post-receive -- Postgres-XC |
From: mason_s <ma...@us...> - 2010-08-23 03:55:28
|
Project "Postgres-XC". The branch, master has been updated via cfb29183b57e811e0dfcf3641c5cc58458b2584a (commit) from fbaab7cc05f975cd6339918390fd22360744b08c (commit) - Log ----------------------------------------------------------------- commit cfb29183b57e811e0dfcf3641c5cc58458b2584a Author: M S <masonsharp@mason-sharps-macbook.local> Date: Mon Aug 23 12:53:55 2010 +0900 Portal integration changes. This integrates Postgres-XC code deeper into PostgreSQL. The Extended Query Protocol can now be used, which means that JDBC will now work. It also lays more groundwork for supporting multi-step queries (cross-node joins). Note that statements with parameters cannot yet be prepared and executed, only those without parameters will work. Note also that this patch introduces additional performance degradation because more processing occurs with each request. We will be working to address these issues in the coming weeks. Written by Andrei Martsinchyk diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 08e35ae..657413a 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -179,7 +179,6 @@ typedef struct CopyStateData /* Locator information */ RelationLocInfo *rel_loc; /* the locator key */ int hash_idx; /* index of the hash column */ - bool on_coord; DataNodeHandle **connections; /* Involved data node connections */ #endif @@ -800,31 +799,6 @@ CopyQuoteIdentifier(StringInfo query_buf, char *value) } #endif -#ifdef PGXC -/* - * In case there is no locator info available, copy to/from is launched in portal on coordinator. - * This happens for pg_catalog tables (not user defined ones) - * such as pg_catalog, pg_attribute, etc. - * This part is launched before the portal is activated, so check a first time if there - * some locator data for this relid and if no, return and launch the portal. - */ -bool -IsCoordPortalCopy(const CopyStmt *stmt) -{ - RelationLocInfo *rel_loc; /* the locator key */ - - /* In the case of a COPY SELECT, this is launched on datanodes */ - if(!stmt->relation) - return false; - - rel_loc = GetRelationLocInfo(RangeVarGetRelid(stmt->relation, true)); - - if (!rel_loc) - return true; - - return false; -} -#endif /* * DoCopy executes the SQL COPY statement @@ -857,11 +831,7 @@ IsCoordPortalCopy(const CopyStmt *stmt) * the table or the specifically requested columns. */ uint64 -#ifdef PGXC -DoCopy(const CopyStmt *stmt, const char *queryString, bool exec_on_coord_portal) -#else DoCopy(const CopyStmt *stmt, const char *queryString) -#endif { CopyState cstate; bool is_from = stmt->is_from; @@ -883,16 +853,6 @@ DoCopy(const CopyStmt *stmt, const char *queryString) /* Allocate workspace and zero all fields */ cstate = (CopyStateData *) palloc0(sizeof(CopyStateData)); -#ifdef PGXC - /* - * Copy to/from is initialized as being launched on datanodes - * This functionnality is particularly interesting to have a result for - * tables who have no locator informations such as pg_catalog, pg_class, - * and pg_attribute. - */ - cstate->on_coord = false; -#endif - /* Extract options from the statement node tree */ foreach(option, stmt->options) { @@ -1180,13 +1140,15 @@ DoCopy(const CopyStmt *stmt, const char *queryString) exec_nodes = (Exec_Nodes *) palloc0(sizeof(Exec_Nodes)); + /* + * If target table does not exists on nodes (e.g. system table) + * the location info returned is NULL. This is the criteria, when + * we need to run Copy on coordinator + */ cstate->rel_loc = GetRelationLocInfo(RelationGetRelid(cstate->rel)); - if (exec_on_coord_portal) - cstate->on_coord = true; - hash_att = GetRelationHashColumn(cstate->rel_loc); - if (!cstate->on_coord) + if (cstate->rel_loc) { if (is_from || hash_att) exec_nodes->nodelist = list_copy(cstate->rel_loc->nodeList); @@ -1481,7 +1443,7 @@ DoCopy(const CopyStmt *stmt, const char *queryString) * In the case of CopyOut, it is just necessary to pick up one node randomly. * This is done when rel_loc is found. */ - if (!cstate->on_coord) + if (cstate->rel_loc) { cstate->connections = DataNodeCopyBegin(cstate->query_buf.data, exec_nodes->nodelist, @@ -1506,7 +1468,7 @@ DoCopy(const CopyStmt *stmt, const char *queryString) } PG_CATCH(); { - if (IS_PGXC_COORDINATOR && is_from && !cstate->on_coord) + if (IS_PGXC_COORDINATOR && is_from && cstate->rel_loc) { DataNodeCopyFinish( cstate->connections, @@ -1519,18 +1481,13 @@ DoCopy(const CopyStmt *stmt, const char *queryString) PG_RE_THROW(); } PG_END_TRY(); - if (IS_PGXC_COORDINATOR && is_from && !cstate->on_coord) + if (IS_PGXC_COORDINATOR && is_from && cstate->rel_loc) { - if (cstate->rel_loc->locatorType == LOCATOR_TYPE_REPLICATED) - cstate->processed = DataNodeCopyFinish( - cstate->connections, - primary_data_node, - COMBINE_TYPE_SAME); - else - cstate->processed = DataNodeCopyFinish( - cstate->connections, - 0, - COMBINE_TYPE_SUM); + bool replicated = cstate->rel_loc->locatorType == LOCATOR_TYPE_REPLICATED; + DataNodeCopyFinish( + cstate->connections, + replicated ? primary_data_node : 0, + replicated ? COMBINE_TYPE_SAME : COMBINE_TYPE_SUM); pfree(cstate->connections); pfree(cstate->query_buf.data); FreeRelationLocInfo(cstate->rel_loc); @@ -1770,7 +1727,7 @@ CopyTo(CopyState cstate) } #ifdef PGXC - if (IS_PGXC_COORDINATOR && !cstate->on_coord) + if (IS_PGXC_COORDINATOR && cstate->rel_loc) { cstate->processed = DataNodeCopyOut( GetRelationNodes(cstate->rel_loc, NULL, true), @@ -2480,7 +2437,7 @@ CopyFrom(CopyState cstate) } #ifdef PGXC - if (IS_PGXC_COORDINATOR && !cstate->on_coord) + if (IS_PGXC_COORDINATOR && cstate->rel_loc) { Datum *hash_value = NULL; @@ -2494,6 +2451,7 @@ CopyFrom(CopyState cstate) ereport(ERROR, (errcode(ERRCODE_CONNECTION_EXCEPTION), errmsg("Copy failed on a data node"))); + cstate->processed++; } else { diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 131be22..847b556 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -858,6 +858,14 @@ InitPlan(QueryDesc *queryDesc, int eflags) { case CMD_SELECT: case CMD_INSERT: +#ifdef PGXC + /* + * PGXC RemoteQuery do not require ctid junk field, so follow + * standard procedure for UPDATE and DELETE + */ + case CMD_UPDATE: + case CMD_DELETE: +#endif foreach(tlist, plan->targetlist) { TargetEntry *tle = (TargetEntry *) lfirst(tlist); @@ -869,10 +877,12 @@ InitPlan(QueryDesc *queryDesc, int eflags) } } break; +#ifndef PGXC case CMD_UPDATE: case CMD_DELETE: junk_filter_needed = true; break; +#endif default: break; } diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c index 15af711..1affd6c 100644 --- a/src/backend/executor/execProcnode.c +++ b/src/backend/executor/execProcnode.c @@ -108,7 +108,9 @@ #include "executor/nodeWindowAgg.h" #include "executor/nodeWorktablescan.h" #include "miscadmin.h" - +#ifdef PGXC +#include "pgxc/execRemote.h" +#endif /* ------------------------------------------------------------------------ * ExecInitNode @@ -286,6 +288,13 @@ ExecInitNode(Plan *node, EState *estate, int eflags) estate, eflags); break; +#ifdef PGXC + case T_RemoteQuery: + result = (PlanState *) ExecInitRemoteQuery((RemoteQuery *) node, + estate, eflags); + break; +#endif + default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); result = NULL; /* keep compiler quiet */ @@ -451,6 +460,12 @@ ExecProcNode(PlanState *node) result = ExecLimit((LimitState *) node); break; +#ifdef PGXC + case T_RemoteQueryState: + result = ExecRemoteQuery((RemoteQueryState *) node); + break; +#endif + default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); result = NULL; @@ -627,6 +642,11 @@ ExecCountSlotsNode(Plan *node) case T_Limit: return ExecCountSlotsLimit((Limit *) node); +#ifdef PGXC + case T_RemoteQuery: + return ExecCountSlotsRemoteQuery((RemoteQuery *) node); +#endif + default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); break; @@ -783,6 +803,12 @@ ExecEndNode(PlanState *node) ExecEndLimit((LimitState *) node); break; +#ifdef PGXC + case T_RemoteQueryState: + ExecEndRemoteQuery((RemoteQueryState *) node); + break; +#endif + default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); break; diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 0a8d783..8dd924d 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -38,6 +38,10 @@ #include "parser/parse_expr.h" #include "parser/parse_oper.h" #include "parser/parsetree.h" +#ifdef PGXC +#include "pgxc/pgxc.h" +#include "pgxc/planner.h" +#endif #include "utils/lsyscache.h" #include "utils/syscache.h" @@ -119,7 +123,12 @@ planner(Query *parse, int cursorOptions, ParamListInfo boundParams) if (planner_hook) result = (*planner_hook) (parse, cursorOptions, boundParams); else - result = standard_planner(parse, cursorOptions, boundParams); +#ifdef PGXC + if (IS_PGXC_COORDINATOR) + result = pgxc_planner(parse, cursorOptions, boundParams); + else +#endif + result = standard_planner(parse, cursorOptions, boundParams); return result; } diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c index b5be190..5b2e03f 100644 --- a/src/backend/parser/analyze.c +++ b/src/backend/parser/analyze.c @@ -39,6 +39,11 @@ #include "parser/parse_target.h" #include "parser/parsetree.h" #include "rewrite/rewriteManip.h" +#ifdef PGXC +#include "pgxc/pgxc.h" +#include "pgxc/planner.h" +#include "tcop/tcopprot.h" +#endif #include "utils/rel.h" @@ -60,6 +65,10 @@ static Query *transformDeclareCursorStmt(ParseState *pstate, DeclareCursorStmt *stmt); static Query *transformExplainStmt(ParseState *pstate, ExplainStmt *stmt); +#ifdef PGXC +static Query *transformExecDirectStmt(ParseState *pstate, ExecDirectStmt *stmt); +#endif + static void transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc); static bool check_parameter_resolution_walker(Node *node, ParseState *pstate); @@ -206,6 +215,13 @@ transformStmt(ParseState *pstate, Node *parseTree) (ExplainStmt *) parseTree); break; +#ifdef PGXC + case T_ExecDirectStmt: + result = transformExecDirectStmt(pstate, + (ExecDirectStmt *) parseTree); + break; +#endif + default: /* @@ -270,6 +286,17 @@ analyze_requires_snapshot(Node *parseTree) result = true; break; +#ifdef PGXC + case T_ExecDirectStmt: + + /* + * We will parse/analyze/plan inner query, which probably will + * need a snapshot. Ensure it is set. + */ + result = true; + break; +#endif + default: /* utility statements don't have any active parse analysis */ result = false; @@ -2025,6 +2052,25 @@ transformExplainStmt(ParseState *pstate, ExplainStmt *stmt) return result; } +#ifdef PGXC +/* + * transformExecDirectStmt - + * transform an EXECUTE DIRECT Statement + * + * Handling is depends if we should execute on nodes or on coordinator. + * To execute on nodes we return CMD_UTILITY query having one T_RemoteQuery node + * with the inner statement as a sql_command. + * If statement is to run on coordinator we should parse inner statement and + * analyze resulting query tree. + */ +static Query * +transformExecDirectStmt(ParseState *pstate, ExecDirectStmt *stmt) +{ + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("Support for EXECUTE DIRECT is temporary broken"))); +} +#endif /* exported so planner can check again after rewriting, query pullup, etc */ void diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 2608a3f..f47cc6a 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -52,6 +52,7 @@ #ifdef PGXC #include "pgxc/locator.h" #include "pgxc/pgxc.h" +#include "pgxc/planner.h" #endif #include "rewrite/rewriteManip.h" @@ -261,9 +262,9 @@ transformCreateStmt(CreateStmt *stmt, const char *queryString) result = list_concat(result, save_alist); #ifdef PGXC - /* - * If the user did not specify any distribution clause and there is no - * inherits clause, try and use PK or unique index + /* + * If the user did not specify any distribution clause and there is no + * inherits clause, try and use PK or unique index */ if (!stmt->distributeby && !stmt->inhRelations && cxt.fallback_dist_col) { @@ -271,6 +272,13 @@ transformCreateStmt(CreateStmt *stmt, const char *queryString) stmt->distributeby->disttype = DISTTYPE_HASH; stmt->distributeby->colname = cxt.fallback_dist_col; } + if (IS_PGXC_COORDINATOR) + { + RemoteQuery *step = makeNode(RemoteQuery); + step->combine_type = COMBINE_TYPE_SAME; + step->sql_statement = queryString; + result = lappend(result, step); + } #endif return result; } @@ -1171,7 +1179,7 @@ transformIndexConstraint(Constraint *constraint, CreateStmtContext *cxt) { if (cxt->distributeby) isLocalSafe = CheckLocalIndexColumn ( - ConvertToLocatorType(cxt->distributeby->disttype), + ConvertToLocatorType(cxt->distributeby->disttype), cxt->distributeby->colname, key); } #endif @@ -1273,7 +1281,7 @@ transformIndexConstraint(Constraint *constraint, CreateStmtContext *cxt) { /* * Set fallback distribution column. - * If not set, set it to first column in index. + * If not set, set it to first column in index. * If primary key, we prefer that over a unique constraint. */ if (index->indexParams == NIL @@ -1281,7 +1289,7 @@ transformIndexConstraint(Constraint *constraint, CreateStmtContext *cxt) { cxt->fallback_dist_col = pstrdup(key); } - + /* Existing table, check if it is safe */ if (!cxt->distributeby && !isLocalSafe) isLocalSafe = CheckLocalIndexColumn ( @@ -1299,7 +1307,7 @@ transformIndexConstraint(Constraint *constraint, CreateStmtContext *cxt) index->indexParams = lappend(index->indexParams, iparam); } #ifdef PGXC - if (IS_PGXC_COORDINATOR && cxt->distributeby + if (IS_PGXC_COORDINATOR && cxt->distributeby && cxt->distributeby->disttype == DISTTYPE_HASH && !isLocalSafe) ereport(ERROR, (errcode(ERRCODE_INVALID_COLUMN_REFERENCE), @@ -1618,7 +1626,7 @@ transformRuleStmt(RuleStmt *stmt, const char *queryString, ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), errmsg("Rule may not use NOTIFY, it is not yet supported"))); - + #endif /* * Since outer ParseState isn't parent of inner, have to pass down @@ -1956,7 +1964,15 @@ transformAlterTableStmt(AlterTableStmt *stmt, const char *queryString) result = lappend(cxt.blist, stmt); result = list_concat(result, cxt.alist); result = list_concat(result, save_alist); - +#ifdef PGXC + if (IS_PGXC_COORDINATOR) + { + RemoteQuery *step = makeNode(RemoteQuery); + step->combine_type = COMBINE_TYPE_SAME; + step->sql_statement = queryString; + result = lappend(result, step); + } +#endif return result; } diff --git a/src/backend/pgxc/plan/planner.c b/src/backend/pgxc/plan/planner.c index 002e710..1dcfc29 100644 --- a/src/backend/pgxc/plan/planner.c +++ b/src/backend/pgxc/plan/planner.c @@ -25,6 +25,7 @@ #include "nodes/nodes.h" #include "nodes/parsenodes.h" #include "optimizer/clauses.h" +#include "optimizer/planner.h" #include "optimizer/tlist.h" #include "parser/parse_agg.h" #include "parser/parse_coerce.h" @@ -116,7 +117,7 @@ typedef struct ColumnBase */ typedef struct XCWalkerContext { - Query *query; + Query *query; bool isRead; Exec_Nodes *exec_nodes; /* resulting execution nodes */ Special_Conditions *conditions; @@ -125,6 +126,7 @@ typedef struct XCWalkerContext int varno; bool within_or; bool within_not; + bool exec_on_coord; /* fallback to standard planner to have plan executed on coordinator only */ List *join_list; /* A list of List*'s, one for each relation. */ } XCWalkerContext; @@ -971,6 +973,7 @@ get_plan_nodes_walker(Node *query_node, XCWalkerContext *context) /* just pg_catalog tables */ context->exec_nodes = (Exec_Nodes *) palloc0(sizeof(Exec_Nodes)); context->exec_nodes->tableusagetype = TABLE_USAGE_TYPE_PGCATALOG; + context->exec_on_coord = true; return false; } @@ -1087,6 +1090,7 @@ get_plan_nodes_walker(Node *query_node, XCWalkerContext *context) { context->exec_nodes = (Exec_Nodes *) palloc0(sizeof(Exec_Nodes)); context->exec_nodes->tableusagetype = TABLE_USAGE_TYPE_PGCATALOG; + context->exec_on_coord = true; return false; } @@ -1253,7 +1257,7 @@ get_plan_nodes_walker(Node *query_node, XCWalkerContext *context) static Exec_Nodes * get_plan_nodes(Query *query, bool isRead) { - Exec_Nodes *result_nodes; + Exec_Nodes *result_nodes = NULL; XCWalkerContext context; @@ -1267,13 +1271,16 @@ get_plan_nodes(Query *query, bool isRead) context.varno = 0; context.within_or = false; context.within_not = false; + context.exec_on_coord = false; context.join_list = NIL; - if (get_plan_nodes_walker((Node *) query, &context)) - result_nodes = NULL; - else + if (!get_plan_nodes_walker((Node *) query, &context)) result_nodes = context.exec_nodes; - + if (context.exec_on_coord && result_nodes) + { + pfree(result_nodes); + result_nodes = NULL; + } free_special_relations(context.conditions); free_join_list(context.join_list); return result_nodes; @@ -1976,68 +1983,89 @@ make_simple_sort_from_sortclauses(Query *query, RemoteQuery *step) * For the prototype, there will only be one step, * and the nodelist will be NULL if it is not a PGXC-safe statement. */ -Query_Plan * -GetQueryPlan(Node *parsetree, const char *sql_statement, List *querytree_list) +PlannedStmt * +pgxc_planner(Query *query, int cursorOptions, ParamListInfo boundParams) { - Query_Plan *query_plan = palloc(sizeof(Query_Plan)); + /* + * We waste some time invoking standard planner, but getting good enough + * PlannedStmt, we just need to replace standard plan. + * In future we may want to skip the standard_planner invocation and + * initialize the PlannedStmt here. At the moment not all queries works: + * ex. there was a problem with INSERT into a subset of table columns + */ + PlannedStmt *result = standard_planner(query, cursorOptions, boundParams); + Plan *standardPlan = result->planTree; RemoteQuery *query_step = makeNode(RemoteQuery); - Query *query; - query_step->sql_statement = (char *) palloc(strlen(sql_statement) + 1); - strcpy(query_step->sql_statement, sql_statement); + query_step->sql_statement = pstrdup(query->sql_statement); query_step->exec_nodes = NULL; query_step->combine_type = COMBINE_TYPE_NONE; query_step->simple_aggregates = NULL; - query_step->read_only = false; + /* Optimize multi-node handling */ + query_step->read_only = query->nodeTag == T_SelectStmt; query_step->force_autocommit = false; - query_plan->query_step_list = lappend(NULL, query_step); + result->planTree = (Plan *) query_step; /* * Determine where to execute the command, either at the Coordinator * level, Data Nodes, or both. By default we choose both. We should be * able to quickly expand this for more commands. */ - switch (nodeTag(parsetree)) + switch (query->nodeTag) { case T_SelectStmt: - /* Optimize multi-node handling */ - query_step->read_only = true; + /* Perform some checks to make sure we can support the statement */ + if (query->intoClause) + ereport(ERROR, + (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), + (errmsg("INTO clause not yet supported")))); + + if (query->setOperations) + ereport(ERROR, + (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), + (errmsg("UNION, INTERSECT and EXCEPT are not yet supported")))); + + if (query->hasRecursive) + ereport(ERROR, + (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), + (errmsg("WITH RECURSIVE not yet supported")))); + + if (query->hasWindowFuncs) + ereport(ERROR, + (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), + (errmsg("Window functions not yet supported")))); /* fallthru */ case T_InsertStmt: case T_UpdateStmt: case T_DeleteStmt: - /* just use first one in querytree_list */ - query = (Query *) linitial(querytree_list); - /* should copy instead ? */ - query_step->plan.targetlist = query->targetList; + query_step->exec_nodes = get_plan_nodes_command(query); - /* Perform some checks to make sure we can support the statement */ - if (nodeTag(parsetree) == T_SelectStmt) + if (query_step->exec_nodes == NULL) { - if (query->intoClause) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("INTO clause not yet supported")))); - - if (query->setOperations) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("UNION, INTERSECT and EXCEPT are not yet supported")))); - - if (query->hasRecursive) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("WITH RECURSIVE not yet supported")))); - - if (query->hasWindowFuncs) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("Window functions not yet supported")))); + /* + * Processing guery against catalog tables, restore + * standard plan + */ + result->planTree = standardPlan; + return result; } - query_step->exec_nodes = - get_plan_nodes_command(query); + /* + * PGXCTODO + * When Postgres runs insert into t (a) values (1); against table + * defined as create table t (a int, b int); the plan is looking + * like insert into t (a,b) values (1,null); + * Later executor is verifying plan, to make sure table has not + * been altered since plan has been created and comparing table + * definition with plan target list and output error if they do + * not match. + * I could not find better way to generate targetList for pgxc plan + * then call standard planner and take targetList from the plan + * generated by Postgres. + */ + query_step->plan.targetlist = standardPlan->targetlist; + if (query_step->exec_nodes) query_step->combine_type = get_plan_combine_type( query, query_step->exec_nodes->baselocatortype); @@ -2047,37 +2075,9 @@ GetQueryPlan(Node *parsetree, const char *sql_statement, List *querytree_list) query_step->simple_aggregates = get_simple_aggregates(query); /* - * See if it is a SELECT with no relations, like SELECT 1+1 or - * SELECT nextval('fred'), and just use coord. - */ - if (query_step->exec_nodes == NULL - && (query->jointree->fromlist == NULL - || query->jointree->fromlist->length == 0)) - /* Just execute it on Coordinator */ - query_plan->exec_loc_type = EXEC_ON_COORD; - else - { - if (query_step->exec_nodes != NULL - && query_step->exec_nodes->tableusagetype == TABLE_USAGE_TYPE_PGCATALOG) - { - /* pg_catalog query, run on coordinator */ - query_plan->exec_loc_type = EXEC_ON_COORD; - } - else - { - query_plan->exec_loc_type = EXEC_ON_DATA_NODES; - - /* If node list is NULL, execute on coordinator */ - if (!query_step->exec_nodes) - query_plan->exec_loc_type = EXEC_ON_COORD; - } - } - - /* * Add sortring to the step */ - if (query_plan->exec_loc_type == EXEC_ON_DATA_NODES && - list_length(query_step->exec_nodes->nodelist) > 1 && + if (list_length(query_step->exec_nodes->nodelist) > 1 && (query->sortClause || query->distinctClause)) make_simple_sort_from_sortclauses(query, query_step); @@ -2090,7 +2090,7 @@ GetQueryPlan(Node *parsetree, const char *sql_statement, List *querytree_list) * Check if we have multiple nodes and an unsupported clause. This * is temporary until we expand supported SQL */ - if (nodeTag(parsetree) == T_SelectStmt) + if (query->nodeTag == T_SelectStmt) { if (StrictStatementChecking && query_step->exec_nodes && list_length(query_step->exec_nodes->nodelist) > 1) @@ -2110,180 +2110,6 @@ GetQueryPlan(Node *parsetree, const char *sql_statement, List *querytree_list) } } break; - - /* Statements that we only want to execute on the Coordinator */ - case T_VariableShowStmt: - query_plan->exec_loc_type = EXEC_ON_COORD; - break; - - /* - * Statements that need to run in autocommit mode, on Coordinator - * and Data Nodes with suppressed implicit two phase commit. - */ - case T_CheckPointStmt: - case T_ClusterStmt: - case T_CreatedbStmt: - case T_DropdbStmt: - case T_VacuumStmt: - query_plan->exec_loc_type = EXEC_ON_COORD | EXEC_ON_DATA_NODES; - query_step->force_autocommit = true; - break; - - case T_DropPropertyStmt: - /* - * Triggers are not yet supported by PGXC - * all other queries are executed on both Coordinator and Datanode - * On the same point, assert also is not supported - */ - if (((DropPropertyStmt *)parsetree)->removeType == OBJECT_TRIGGER) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("This command is not yet supported.")))); - else - query_plan->exec_loc_type = EXEC_ON_COORD | EXEC_ON_DATA_NODES; - break; - - case T_CreateStmt: - if (((CreateStmt *)parsetree)->relation->istemp) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("Temp tables are not yet supported.")))); - - query_plan->exec_loc_type = EXEC_ON_COORD | EXEC_ON_DATA_NODES; - break; - - /* - * Statements that we execute on both the Coordinator and Data Nodes - */ - case T_AlterDatabaseStmt: - case T_AlterDatabaseSetStmt: - case T_AlterDomainStmt: - case T_AlterFdwStmt: - case T_AlterForeignServerStmt: - case T_AlterFunctionStmt: - case T_AlterObjectSchemaStmt: - case T_AlterOpFamilyStmt: - case T_AlterSeqStmt: - case T_AlterTableStmt: /* Can also be used to rename a sequence */ - case T_AlterTSConfigurationStmt: - case T_AlterTSDictionaryStmt: - case T_ClosePortalStmt: /* In case CLOSE ALL is issued */ - case T_CommentStmt: - case T_CompositeTypeStmt: - case T_ConstraintsSetStmt: - case T_CreateCastStmt: - case T_CreateConversionStmt: - case T_CreateDomainStmt: - case T_CreateEnumStmt: - case T_CreateFdwStmt: - case T_CreateForeignServerStmt: - case T_CreateFunctionStmt: /* Only global functions are supported */ - case T_CreateOpClassStmt: - case T_CreateOpFamilyStmt: - case T_CreatePLangStmt: - case T_CreateSeqStmt: - case T_CreateSchemaStmt: - case T_DeallocateStmt: /* Allow for DEALLOCATE ALL */ - case T_DiscardStmt: - case T_DropCastStmt: - case T_DropFdwStmt: - case T_DropForeignServerStmt: - case T_DropPLangStmt: - case T_DropStmt: - case T_IndexStmt: - case T_LockStmt: - case T_ReindexStmt: - case T_RemoveFuncStmt: - case T_RemoveOpClassStmt: - case T_RemoveOpFamilyStmt: - case T_RenameStmt: - case T_RuleStmt: - case T_TruncateStmt: - case T_VariableSetStmt: - case T_ViewStmt: - - /* - * Also support these, should help later with pg_restore, although - * not very useful because of the pooler using the same user - */ - case T_GrantStmt: - case T_GrantRoleStmt: - case T_CreateRoleStmt: - case T_AlterRoleStmt: - case T_AlterRoleSetStmt: - case T_AlterUserMappingStmt: - case T_CreateUserMappingStmt: - case T_DropRoleStmt: - case T_AlterOwnerStmt: - case T_DropOwnedStmt: - case T_DropUserMappingStmt: - case T_ReassignOwnedStmt: - case T_DefineStmt: /* used for aggregates, some types */ - query_plan->exec_loc_type = EXEC_ON_COORD | EXEC_ON_DATA_NODES; - break; - - case T_TransactionStmt: - switch (((TransactionStmt *) parsetree)->kind) - { - case TRANS_STMT_SAVEPOINT: - case TRANS_STMT_RELEASE: - case TRANS_STMT_ROLLBACK_TO: - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("This type of transaction statement not yet supported")))); - break; - - default: - break; /* keep compiler quiet */ - } - query_plan->exec_loc_type = EXEC_ON_COORD | EXEC_ON_DATA_NODES; - break; - - /* - * For now, pick one of the data nodes until we modify real - * planner It will give an approximate idea of what an isolated - * data node will do - */ - case T_ExplainStmt: - if (((ExplainStmt *) parsetree)->analyze) - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("ANALYZE with EXPLAIN is currently not supported.")))); - - query_step->exec_nodes = palloc0(sizeof(Exec_Nodes)); - query_step->exec_nodes->nodelist = GetAnyDataNode(); - query_step->exec_nodes->baselocatortype = LOCATOR_TYPE_RROBIN; - query_plan->exec_loc_type = EXEC_ON_DATA_NODES; - break; - - /* - * Trigger queries are not yet supported by PGXC. - * Tablespace queries are also not yet supported. - * Two nodes on the same servers cannot use the same tablespace. - */ - case T_CreateTableSpaceStmt: - case T_CreateTrigStmt: - case T_DropTableSpaceStmt: - ereport(ERROR, - (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), - (errmsg("This command is not yet supported.")))); - break; - - /* - * Other statements we do not yet want to handle. - * By default they would be fobidden, but we list these for reference. - * Note that there is not a 1-1 correspndence between - * SQL command and the T_*Stmt structures. - */ - case T_DeclareCursorStmt: - case T_ExecuteStmt: - case T_FetchStmt: - case T_ListenStmt: - case T_LoadStmt: - case T_NotifyStmt: - case T_PrepareStmt: - case T_UnlistenStmt: - /* fall through */ default: /* Allow for override */ if (StrictStatementChecking) @@ -2291,12 +2117,10 @@ GetQueryPlan(Node *parsetree, const char *sql_statement, List *querytree_list) (errcode(ERRCODE_STATEMENT_TOO_COMPLEX), (errmsg("This command is not yet supported.")))); else - query_plan->exec_loc_type = EXEC_ON_COORD | EXEC_ON_DATA_NODES; - break; + result->planTree = standardPlan; } - - return query_plan; + return result; } @@ -2321,21 +2145,3 @@ free_query_step(RemoteQuery *query_step) list_free_deep(query_step->simple_aggregates); pfree(query_step); } - -/* - * Free Query_Plan struct - */ -void -FreeQueryPlan(Query_Plan *query_plan) -{ - ListCell *item; - - if (query_plan == NULL) - return; - - foreach(item, query_plan->query_step_list) - free_query_step((RemoteQuery *) lfirst(item)); - - pfree(query_plan->query_step_list); - pfree(query_plan); -} diff --git a/src/backend/pgxc/pool/execRemote.c b/src/backend/pgxc/pool/execRemote.c index bbedef0..0f16c51 100644 --- a/src/backend/pgxc/pool/execRemote.c +++ b/src/backend/pgxc/pool/execRemote.c @@ -168,9 +168,7 @@ CreateResponseCombiner(int node_count, CombineType combine_type) combiner->connections = NULL; combiner->conn_count = 0; combiner->combine_type = combine_type; - combiner->dest = NULL; combiner->command_complete_count = 0; - combiner->row_count = 0; combiner->request_type = REQUEST_TYPE_NOT_DEFINED; combiner->tuple_desc = NULL; combiner->description_count = 0; @@ -178,7 +176,6 @@ CreateResponseCombiner(int node_count, CombineType combine_type) combiner->copy_out_count = 0; combiner->errorMessage = NULL; combiner->query_Done = false; - combiner->completionTag = NULL; combiner->msg = NULL; combiner->msglen = 0; combiner->initAggregates = true; @@ -488,7 +485,8 @@ HandleCopyOutComplete(RemoteQueryState *combiner) static void HandleCommandComplete(RemoteQueryState *combiner, char *msg_body, size_t len) { - int digits = 0; + int digits = 0; + EState *estate = combiner->ss.ps.state; /* * If we did not receive description we are having rowcount or OK response @@ -496,7 +494,7 @@ HandleCommandComplete(RemoteQueryState *combiner, char *msg_body, size_t len) if (combiner->request_type == REQUEST_TYPE_NOT_DEFINED) combiner->request_type = REQUEST_TYPE_COMMAND; /* Extract rowcount */ - if (combiner->combine_type != COMBINE_TYPE_NONE) + if (combiner->combine_type != COMBINE_TYPE_NONE && estate) { uint64 rowcount; digits = parse_row_count(msg_body, len, &rowcount); @@ -507,7 +505,7 @@ HandleCommandComplete(RemoteQueryState *combiner, char *msg_body, size_t len) { if (combiner->command_complete_count) { - if (rowcount != combiner->row_count) + if (rowcount != estate->es_processed) /* There is a consistency issue in the database with the replicated table */ ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), @@ -515,37 +513,15 @@ HandleCommandComplete(RemoteQueryState *combiner, char *msg_body, size_t len) } else /* first result */ - combiner->row_count = rowcount; + estate->es_processed = rowcount; } else - combiner->row_count += rowcount; + estate->es_processed += rowcount; } else combiner->combine_type = COMBINE_TYPE_NONE; } - if (++combiner->command_complete_count == combiner->node_count) - { - if (combiner->completionTag) - { - if (combiner->combine_type == COMBINE_TYPE_NONE) - { - /* ensure we do not go beyond buffer bounds */ - if (len > COMPLETION_TAG_BUFSIZE) - len = COMPLETION_TAG_BUFSIZE; - memcpy(combiner->completionTag, msg_body, len); - } - else - { - /* Truncate msg_body to get base string */ - msg_body[len - digits - 1] = '\0'; - snprintf(combiner->completionTag, - COMPLETION_TAG_BUFSIZE, - "%s" UINT64_FORMAT, - msg_body, - combiner->row_count); - } - } - } + combiner->command_complete_count++; } /* @@ -653,6 +629,9 @@ HandleCopyDataRow(RemoteQueryState *combiner, char *msg_body, size_t len) (errcode(ERRCODE_DATA_CORRUPTED), errmsg("Unexpected response from the data nodes for 'd' message, current request type %d", combiner->request_type))); + /* count the row */ + combiner->processed++; + /* If there is a copy file, data has to be sent to the local file */ if (combiner->copy_file) /* write data to the copy file */ @@ -881,7 +860,6 @@ ValidateAndResetCombiner(RemoteQueryState *combiner) combiner->command_complete_count = 0; combiner->connections = NULL; combiner->conn_count = 0; - combiner->row_count = 0; combiner->request_type = REQUEST_TYPE_NOT_DEFINED; combiner->tuple_desc = NULL; combiner->description_count = 0; @@ -1106,7 +1084,6 @@ data_node_begin(int conn_count, DataNodeHandle ** connections, } combiner = CreateResponseCombiner(conn_count, COMBINE_TYPE_NONE); - combiner->dest = None_Receiver; /* Receive responses */ if (data_node_receive_responses(conn_count, connections, timeout, combiner)) @@ -1225,7 +1202,6 @@ data_node_commit(int conn_count, DataNodeHandle ** connections) } combiner = CreateResponseCombiner(conn_count, COMBINE_TYPE_NONE); - combiner->dest = None_Receiver; /* Receive responses */ if (data_node_receive_responses(conn_count, connections, timeout, combiner)) result = EOF; @@ -1268,10 +1244,7 @@ data_node_commit(int conn_count, DataNodeHandle ** connections) } if (!combiner) - { combiner = CreateResponseCombiner(conn_count, COMBINE_TYPE_NONE); - combiner->dest = None_Receiver; - } /* Receive responses */ if (data_node_receive_responses(conn_count, connections, timeout, combiner)) result = EOF; @@ -1336,7 +1309,6 @@ data_node_rollback(int conn_count, DataNodeHandle ** connections) } combiner = CreateResponseCombiner(conn_count, COMBINE_TYPE_NONE); - combiner->dest = None_Receiver; /* Receive responses */ if (data_node_receive_responses(conn_count, connections, timeout, combiner)) return EOF; @@ -1480,7 +1452,6 @@ DataNodeCopyBegin(const char *query, List *nodelist, Snapshot snapshot, bool is_ * client runs console or file copy */ combiner = CreateResponseCombiner(conn_count, COMBINE_TYPE_NONE); - combiner->dest = None_Receiver; /* Receive responses */ if (data_node_receive_responses(conn_count, connections, timeout, combiner) @@ -1541,7 +1512,6 @@ DataNodeCopyIn(char *data_row, int len, Exec_Nodes *exec_nodes, DataNodeHandle** if (primary_handle->inStart < primary_handle->inEnd) { RemoteQueryState *combiner = CreateResponseCombiner(1, COMBINE_TYPE_NONE); - combiner->dest = None_Receiver; handle_response(primary_handle, combiner); if (!ValidateAndCloseCombiner(combiner)) return EOF; @@ -1603,7 +1573,6 @@ DataNodeCopyIn(char *data_row, int len, Exec_Nodes *exec_nodes, DataNodeHandle** if (handle->inStart < handle->inEnd) { RemoteQueryState *combiner = CreateResponseCombiner(1, COMBINE_TYPE_NONE); - combiner->dest = None_Receiver; handle_response(handle, combiner); if (!ValidateAndCloseCombiner(combiner)) return EOF; @@ -1670,13 +1639,13 @@ DataNodeCopyOut(Exec_Nodes *exec_nodes, DataNodeHandle** copy_connections, FILE* bool need_tran; List *nodelist; ListCell *nodeitem; - uint64 processed = 0; + uint64 processed; nodelist = exec_nodes->nodelist; need_tran = !autocommit || conn_count > 1; combiner = CreateResponseCombiner(conn_count, COMBINE_TYPE_SUM); - combiner->dest = None_Receiver; + combiner->processed = 0; /* If there is an existing file where to copy data, pass it to combiner */ if (copy_file) combiner->copy_file = copy_file; @@ -1712,7 +1681,7 @@ DataNodeCopyOut(Exec_Nodes *exec_nodes, DataNodeHandle** copy_connections, FILE* } } - processed = combiner->row_count; + processed = combiner->processed; if (!ValidateAndCloseCombiner(combiner)) { @@ -1730,7 +1699,7 @@ DataNodeCopyOut(Exec_Nodes *exec_nodes, DataNodeHandle** copy_connections, FILE* /* * Finish copy process on all connections */ -uint64 +void DataNodeCopyFinish(DataNodeHandle** copy_connections, int primary_data_node, CombineType combine_type) { @@ -1743,7 +1712,6 @@ DataNodeCopyFinish(DataNodeHandle** copy_connections, int primary_data_node, DataNodeHandle *connections[NumDataNodes]; DataNodeHandle *primary_handle = NULL; int conn_count = 0; - uint64 processed; for (i = 0; i < NumDataNodes; i++) { @@ -1786,8 +1754,7 @@ DataNodeCopyFinish(DataNodeHandle** copy_connections, int primary_data_node, } combiner = CreateResponseCombiner(conn_count + 1, combine_type); - combiner->dest = None_Receiver; - error = data_node_receive_responses(1, &primary_handle, timeout, combiner) || error; + error = (data_node_receive_responses(1, &primary_handle, timeout, combiner) != 0) || error; } for (i = 0; i < conn_count; i++) @@ -1823,22 +1790,25 @@ DataNodeCopyFinish(DataNodeHandle** copy_connections, int primary_data_node, need_tran = !autocommit || primary_handle || conn_count > 1; if (!combiner) - { combiner = CreateResponseCombiner(conn_count, combine_type); - combiner->dest = None_Receiver; - } error = (data_node_receive_responses(conn_count, connections, timeout, combiner) != 0) || error; - processed = combiner->row_count; - if (!ValidateAndCloseCombiner(combiner) || error) ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR), errmsg("Error while running COPY"))); +} - return processed; +#define REMOTE_QUERY_NSLOTS 2 +int +ExecCountSlotsRemoteQuery(RemoteQuery *node) +{ + return ExecCountSlotsNode(outerPlan((Plan *) node)) + + ExecCountSlotsNode(innerPlan((Plan *) node)) + + REMOTE_QUERY_NSLOTS; } + RemoteQueryState * ExecInitRemoteQuery(RemoteQuery *node, EState *estate, int eflags) { @@ -1876,6 +1846,9 @@ ExecInitRemoteQuery(RemoteQuery *node, EState *estate, int eflags) ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE); } + if (outerPlan(node)) + outerPlanState(remotestate) = ExecInitNode(outerPlan(node), estate, eflags); + return remotestate; } @@ -1927,6 +1900,83 @@ copy_slot(RemoteQueryState *node, TupleTableSlot *src, TupleTableSlot *dst) } } +static void +get_exec_connections(Exec_Nodes *exec_nodes, + int *regular_conn_count, + int *total_conn_count, + DataNodeHandle ***connections, + DataNodeHandle ***primaryconnection) +{ + List *nodelist = NIL; + List *primarynode = NIL; + + if (exec_nodes) + { + nodelist = exec_nodes->nodelist; + primarynode = exec_nodes->primarynodelist; + } + + if (list_length(nodelist) == 0) + { + if (primarynode) + *regular_conn_count = NumDataNodes - 1; + else + *regular_conn_count = NumDataNodes; + } + else + { + *regular_conn_count = list_length(nodelist); + } + + *total_conn_count = *regular_conn_count; + + /* Get connection for primary node, if used */ + if (primarynode) + { + *primaryconnection = get_handles(primarynode); + if (!*primaryconnection) + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Could not obtain connection from pool"))); + (*total_conn_count)++; + } + + /* Get other connections (non-primary) */ + *connections = get_handles(nodelist); + if (!*connections) + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Could not obtain connection from pool"))); + +} + +/* + * We would want to run 2PC if current transaction modified more then + * one node. So optimize little bit and do not look further if we + * already have more then one write nodes. + */ +static void +register_write_nodes(int conn_count, DataNodeHandle **connections) +{ + int i, j; + + for (i = 0; i < conn_count && write_node_count < 2; i++) + { + bool found = false; + + for (j = 0; j < write_node_count && !found; j++) + { + if (write_node_list[j] == connections[i]) + found = true; + } + if (!found) + { + /* Add to transaction wide-list */ + write_node_list[write_node_count++] = connections[i]; + } + } +} + /* * Execute step of PGXC plan. * The step specifies a command to be executed on specified nodes. @@ -1950,66 +2000,51 @@ ExecRemoteQuery(RemoteQueryState *node) if (!node->query_Done) { /* First invocation, initialize */ - Exec_Nodes *exec_nodes = step->exec_nodes; bool force_autocommit = step->force_autocommit; bool is_read_only = step->read_only; GlobalTransactionId gxid = InvalidGlobalTransactionId; Snapshot snapshot = GetActiveSnapshot(); DataNodeHandle **connections = NULL; DataNodeHandle **primaryconnection = NULL; - List *nodelist = NIL; - List *primarynode = NIL; int i; - int j; int regular_conn_count; int total_conn_count; bool need_tran; - if (exec_nodes) - { - nodelist = exec_nodes->nodelist; - primarynode = exec_nodes->primarynodelist; - } - - if (list_length(nodelist) == 0) - { - if (primarynode) - regular_conn_count = NumDataNodes - 1; - else - regular_conn_count = NumDataNodes; - } - else + /* + * If coordinator plan is specified execute it first. + * If the plan is returning we are returning these tuples immediately. + * If it is not returning or returned them all by current invocation + * we will go ahead and execute remote query. Then we will never execute + * the outer plan again because node->query_Done flag will be set and + * execution won't get to that place. + */ + if (outerPlanState(node)) { - regular_conn_count = list_length(nodelist); + TupleTableSlot *slot = ExecProcNode(outerPlanState(node)); + if (!TupIsNull(slot)) + return slot; } - total_conn_count = regular_conn_count; - node->node_count = total_conn_count; + get_exec_connections(step->exec_nodes, + ®ular_conn_count, + &total_conn_count, + &connections, + &primaryconnection); - /* Get connection for primary node, if used */ - if (primarynode) - { - primaryconnection = get_handles(primarynode); - if (!primaryconnection) - ereport(ERROR, - (errcode(ERRCODE_INTERNAL_ERROR), - errmsg("Could not obtain connection from pool"))); - total_conn_count++; - } - - /* Get other connections (non-primary) */ - connections = get_handles(nodelist); - if (!connections) - ereport(ERROR, - (errcode(ERRCODE_INTERNAL_ERROR), - errmsg("Could not obtain connection from pool"))); + /* + * We save only regular connections, at the time we exit the function + * we finish with the primary connection and deal only with regular + * connections on subsequent invocations + */ + node->node_count = regular_conn_count; if (force_autocommit) need_tran = false; else need_tran = !autocommit || total_conn_count > 1; - elog(DEBUG1, "autocommit = %s, has primary = %s, regular_conn_count = %d, statement_need_tran = %s", autocommit ? "true" : "false", primarynode ? "true" : "false", regular_conn_count, need_tran ? "true" : "false"); + elog(DEBUG1, "autocommit = %s, has primary = %s, regular_conn_count = %d, need_tran = %s", autocommit ? "true" : "false", primaryconnection ? "true" : "false", regular_conn_count, need_tran ? "true" : "false"); stat_statement(); if (autocommit) @@ -2019,44 +2054,11 @@ ExecRemoteQuery(RemoteQueryState *node) clear_write_node_list(); } - /* Check status of connections */ - /* - * We would want to run 2PC if current transaction modified more then - * one node. So optimize little bit and do not look further if we - * already have two. - */ - if (!is_read_only && write_node_count < 2) + if (!is_read_only) { - bool found; - if (primaryconnection) - { - found = false; - for (j = 0; j < write_node_count && !found; j++) - { - if (write_node_list[j] == primaryconnection[0]) - found = true; - } - if (!found) - { - /* Add to transaction wide-list */ - write_node_list[write_node_count++] = primaryconnection[0]; - } - } - for (i = 0; i < regular_conn_count && write_node_count < 2; i++) - { - found = false; - for (j = 0; j < write_node_count && !found; j++) - { - if (write_node_list[j] == connections[i]) - found = true; - } - if (!found) - { - /* Add to transaction wide-list */ - write_node_list[write_node_count++] = connections[i]; - } - } + register_write_nodes(1, primaryconnection); + register_write_nodes(regular_conn_count, connections); } gxid = GetCurrentGlobalTransactionId(); @@ -2209,12 +2211,10 @@ ExecRemoteQuery(RemoteQueryState *node) { ExecSetSlotDescriptor(scanslot, node->tuple_desc); /* - * we should send to client not the tuple_desc we just - * received, but tuple_desc from the planner. - * Data node may be sending junk columns for sorting + * Now tuple table slot is responcible for freeing the + * descriptor */ - (*node->dest->rStartup) (node->dest, CMD_SELECT, - resultslot->tts_tupleDescriptor); + node->tuple_desc = NULL; if (step->sort) { SimpleSort *sort = step->sort; @@ -2228,7 +2228,7 @@ ExecRemoteQuery(RemoteQueryState *node) * be initialized */ node->tuplesortstate = tuplesort_begin_merge( - node->tuple_desc, + scanslot->tts_tupleDescriptor, sort->numCols, sort->sortColIdx, sort->sortOperators, @@ -2290,7 +2290,6 @@ ExecRemoteQuery(RemoteQueryState *node) } } copy_slot(node, scanslot, resultslot); - (*node->dest->receiveSlot) (resultslot, node->dest); break; } if (!have_tuple) @@ -2310,12 +2309,26 @@ ExecRemoteQuery(RemoteQueryState *node) { if (node->simple_aggregates) { - /* - * Advance aggregate functions and allow to read up next - * data row message and get tuple in the same slot on - * next iteration - */ - exec_simple_aggregates(node, scanslot); + if (node->simple_aggregates) + { + /* + * Advance aggregate functions and allow to read up next + * data row message and get tuple in the same slot on + * next iteration + */ + exec_simple_aggregates(node, scanslot); + } + else + { + /* + * Receive current slot and read up next data row + * message before exiting the loop. Next time when this + * function is invoked we will have either data row + * message ready or EOF + */ + copy_slot(node, scanslot, resultslot); + have_tuple = true; + } } else { @@ -2326,7 +2339,6 @@ ExecRemoteQuery(RemoteQueryState *node) * message ready or EOF */ copy_slot(node, scanslot, resultslot); - (*node->dest->receiveSlot) (resultslot, node->dest); have_tuple = true; } } @@ -2380,10 +2392,7 @@ ExecRemoteQuery(RemoteQueryState *node) { finish_simple_aggregates(node, resultslot); if (!TupIsNull(resultslot)) - { - (*node->dest->receiveSlot) (resultslot, node->dest); have_tuple = true; - } } if (!have_tuple) /* report end of scan */ @@ -2405,12 +2414,234 @@ ExecRemoteQuery(RemoteQueryState *node) void ExecEndRemoteQuery(RemoteQueryState *node) { - (*node->dest->rShutdown) (node->dest); + /* + * Release tuplesort resources + */ + if (node->tuplesortstate != NULL) + tuplesort_end((Tuplesortstate *) node->tuplesortstate); + node->tuplesortstate = NULL; + + /* + * shut down the subplan + */ + if (outerPlanState(node)) + ExecEndNode(outerPlanState(node)); + if (node->tmp_ctx) MemoryContextDelete(node->tmp_ctx); + CloseCombiner(node); } +/* + * Execute utility statement on multiple data nodes + * It does approximately the same as + * + * RemoteQueryState *state = ExecInitRemoteQuery(plan, estate, flags); + * Assert(TupIsNull(ExecRemoteQuery(state)); + * ExecEndRemoteQuery(state) + * + * But does not need an Estate instance and does not do some unnecessary work, + * like allocating tuple slots. + */ +void +ExecRemoteUtility(RemoteQuery *node) +{ + RemoteQueryState *remotestate; + bool force_autocommit = node->force_autocommit; + bool is_read_only = node->read_only; + GlobalTransactionId gxid = InvalidGlobalTransactionId; + Snapshot snapshot = GetActiveSnapshot(); + DataNodeHandle **connections = NULL; + DataNodeHandle **primaryconnection = NULL; + int regular_conn_count; + int total_conn_count; + bool need_tran; + int i; + + remotestate = CreateResponseCombiner(0, node->combine_type); + + get_exec_connections(node->exec_nodes, + ®ular_conn_count, + &total_conn_count, + &connections, + &primaryconnection); + + if (force_autocommit) + need_tran = false; + else + need_tran = !autocommit || total_conn_count > 1; + + if (!is_read_only) + { + if (primaryconnection) + register_write_nodes(1, primaryconnection); + register_write_nodes(regular_conn_count, connections); + } + + gxid = GetCurrentGlobalTransactionId(); + if (!GlobalTransactionIdIsValid(gxid)) + { + if (primaryconnection) + pfree(primaryconnection); + pfree(connections); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to get next transaction ID"))); + } + + if (need_tran) + { + /* + * Check if data node connections are in transaction and start + * transactions on nodes where it is not started + */ + DataNodeHandle *new_connections[total_conn_count]; + int new_count = 0; + + if (primaryconnection && primaryconnection[0]->transaction_status != 'T') + new_connections[new_count++] = primaryconnection[0]; + for (i = 0; i < regular_conn_count; i++) + if (connections[i]->transaction_status != 'T') + new_connections[new_count++] = connections[i]; + + if (new_count) + data_node_begin(new_count, new_connections, gxid); + } + + /* See if we have a primary nodes, execute on it first before the others */ + if (primaryconnection) + { + /* If explicit transaction is needed gxid is already sent */ + if (!need_tran && data_node_send_gxid(primaryconnection[0], gxid)) + { + pfree(connections); + pfree(primaryconnection); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to send command to data nodes"))); + } + if (snapshot && data_node_send_snapshot(primaryconnection[0], snapshot)) + { + pfree(connections); + pfree(primaryconnection); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to send command to data nodes"))); + } + if (data_node_send_query(primaryconnection[0], node->sql_statement) != 0) + { + pfree(connections); + pfree(primaryconnection); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to send command to data nodes"))); + } + + Assert(remotestate->combine_type == COMBINE_TYPE_SAME); + + while (remotestate->command_complete_count < 1) + { + PG_TRY(); + { + data_node_receive(1, primaryconnection, NULL); + while (handle_response(primaryconnection[0], remotestate) == RESPONSE_EOF) + data_node_receive(1, primaryconnection, NULL); + if (remotestate->errorMessage) + { + char *code = remotestate->errorCode; + ereport(ERROR, + (errcode(MAKE_SQLSTATE(code[0], code[1], code[2], code[3], code[4])), + errmsg("%s", remotestate->errorMessage))); + } + } + /* If we got an error response return immediately */ + PG_CATCH(); + { + pfree(primaryconnection); + pfree(connections); + PG_RE_THROW(); + } + PG_END_TRY(); + } + pfree(primaryconnection); + } + + for (i = 0; i < regular_conn_count; i++) + { + /* If explicit transaction is needed gxid is already sent */ + if (!need_tran && data_node_send_gxid(connections[i], gxid)) + { + pfree(connections); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to send command to data nodes"))); + } + if (snapshot && data_node_send_snapshot(connections[i], snapshot)) + { + pfree(connections); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to send command to data nodes"))); + } + if (data_node_send_query(connections[i], node->sql_statement) != 0) + { + pfree(connections); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to send command to data nodes"))); + } + } + + /* + * Stop if all commands are completed or we got a data row and + * initialized state node for subsequent invocations + */ + while (regular_conn_count > 0) + { + int i = 0; + + data_node_receive(regular_conn_count, connections, NULL); + /* + * Handle input from the data nodes. + * If we got a RESPONSE_DATAROW we can break handling to wrap + * it into a tuple and return. Handling will be continued upon + * subsequent invocations. + * If we got 0, we exclude connection from the list. We do not + * expect more input from it. In case of non-SELECT query we quit + * the loop when all nodes finish their work and send ReadyForQuery + * with empty connections array. + * If we got EOF, move to the next connection, will receive more + * data on the next iteration. + */ + while (i < regular_conn_count) + { + int res = handle_response(connections[i], remotestate); + if (res == RESPONSE_EOF) + { + i++; + } + else if (res == RESPONSE_COMPLETE) + { + if (i < --regular_conn_count) + connections[i] = connections[regular_conn_count]; + } + else if (res == RESPONSE_TUPDESC) + { + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Unexpected response from data node"))); + } + else if (res == RESPONSE_DATAROW) + { + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Unexpected response from data node"))); + } + } + } +} + /* * Called when the backend is ending. diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index ea66125..608755f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -650,6 +650,20 @@ pg_analyze_and_rewrite(Node *parsetree, const char *query_string, */ querytree_list = pg_rewrite_query(query); +#ifdef PGXC + if (IS_PGXC_COORDINATOR) + { + ListCell *lc; + + foreach(lc, querytree_list) + { + Query *query = (Query *) lfirst(lc); + query->sql_statement = pstrdup(query_string); + query->nodeTag = nodeTag(parsetree); + } + } +#endif + TRACE_POSTGRESQL_QUERY_REWRITE_DONE(query_string); return querytree_list; @@ -900,9 +914,6 @@ exec_simple_query(const char *query_string) DestReceiver *receiver; int16 format; #ifdef PGXC - Query_Plan *query_plan; - RemoteQuery *query_step; - bool exec_on_coord; /* * By default we do not want data nodes to contact GTM directly, @@ -910,9 +921,6 @@ exec_simple_query(const char *query_string) */ if (IS_PGXC_DATANODE) SetForceXidFromGTM(false); - - exec_on_coord = true; - query_plan = NULL; #endif /* @@ -968,131 +976,11 @@ exec_simple_query(const char *query_string) querytree_list = pg_analyze_and_rewrite(parsetree, query_string, NULL, 0); -#ifdef PGXC /* PGXC_COORD */ - if (IS_PGXC_COORDINATOR) - { - if (IsA(parsetree, TransactionStmt)) - pgxc_transaction_stmt(parsetree); - - else if (IsA(parsetree, ExecDirectStmt)) - { - ExecDirectStmt *execdirect = (ExecDirectStmt *) parsetree; - List *inner_parse_tree_list; - - Assert(IS_PGXC_COORDINATOR); - - exec_on_coord = execdirect->coordinator; - - /* - * Switch to appropriate context for constructing parse and - * query trees (these must outlive the execution context). - */ - oldcontext = MemoryContextSwitchTo(MessageContext); - - inner_parse_tree_list = pg_parse_query(execdirect->query); - /* - * we do not support complex commands (expanded to multiple - * parse trees) within EXEC DIRECT - */ - if (list_length(parsetree_list) != 1) - { - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("Can not execute %s with EXECUTE DIRECT", - execdirect->query))); - } - parsetree = linitial(inner_parse_tree_list); - - /* - * Set up a snapshot if parse analysis/planning will need - * one. - */ - if (analyze_requires_snapshot(parsetree)) - { - PushActiveSnapshot(GetTransactionSnapshot()); - snapshot_set = true; - } - - querytree_list = pg_analyze_and_rewrite(parsetree, - query_string, - NULL, - 0); - - if (execdirect->nodes) - { - ListCell *lc; - Query *query = (Query *) linitial(querytree_list); - - query_plan = (Query_Plan *) palloc0(sizeof(Query_Plan)); - query_step = makeNode(RemoteQuery); - query_step->plan.targetlist = query->targetList; - query_step->sql_statement = pstrdup(execdirect->query); - query_step->exec_nodes = (Exec_Nodes *) palloc0(sizeof(Exec_Nodes)); - foreach (lc, execdirect->nodes) - { - int node = intVal(lfirst(lc)); - query_step->exec_nodes->nodelist = lappend_int(query_step->exec_nodes->nodelist, node); - } - query_step->combine_type = COMBINE_TYPE_SAME; - - query_plan->query_step_list = lappend(NULL, query_step); - query_plan->exec_loc_type = EXEC_ON_DATA_NODES; - } - - /* Restore context */ - MemoryContextSwitchTo(oldcontext); - - } - else if (IsA(parsetree, CopyStmt)) - { - CopyStmt *copy = (CopyStmt *) parsetree; - uint64 processed; - /* Snapshot is needed for the Copy */ - if (!snapshot_set) - { - PushActiveSnapshot(GetTransactionSnapshot()); - snapshot_set = true; - } - /* - * A check on locator is made in DoCopy to determine if the copy can be launched on - * Datanode or on Coordinator. - * If a table has no locator data, then IsCoordPortalCopy returns false and copy is launched - * on Coordinator instead (e.g., using pg_catalog tables). - * If a table has some locator data (user tables), then copy was launched normally - * in Datanodes - */ - if (!IsCoordPortalCopy(copy)) - { - exec_on_coord = false; - processed = DoCopy(copy, query_string, false); - snprintf(completionTag, COMPLETION_TAG_BUFSIZE, - "COPY " UINT64_FORMAT, processed); - } - else - exec_on_coord = true; - } - else - { - query_plan = GetQueryPlan(parsetree, query_string, querytree_list); - - exec_on_coord = query_plan-... [truncated message content] |
From: mason_s <ma...@us...> - 2010-08-05 19:00:31
|
Project "Postgres-XC". The branch, master has been updated via fbaab7cc05f975cd6339918390fd22360744b08c (commit) from c7476b9cf075aba2dd2ed11ea57c632c1ad6721a (commit) - Log ----------------------------------------------------------------- commit fbaab7cc05f975cd6339918390fd22360744b08c Author: Mason S <mas...@ma...> Date: Thu Aug 5 14:55:55 2010 -0400 There is a race condition that could lead to problems for the CLOG and sub transactions. In Postgres-XC, multiple processes may decide to extend the CLOG at the same time. One will wait for the other, then afterwards re-zero out the page. Instead, once the lock is obtained, we re-check to make sure that another process did not already extend and create the page. If so, we just exit. diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 1028aae..2a0f245 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -590,6 +590,9 @@ ExtendCLOG(TransactionId newestXact) /* * The first condition makes sure we did not wrap around * The second checks if we are still using the same page + * Note that this value can change and we are not holding a lock, + * so we repeat the check below. We do it this way instead of + * grabbing the lock to avoid lock contention. */ if (ClogCtl->shared->latest_page_number - pageno <= CLOG_WRAP_CHECK_DELTA && pageno <= ClogCtl->shared->latest_page_number) @@ -604,6 +607,20 @@ ExtendCLOG(TransactionId newestXact) LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); +#ifdef PGXC + /* + * We repeat the check. Another process may have written + * out the page already and advanced the latest_page_number + * while we were waiting for the lock. + */ + if (ClogCtl->shared->latest_page_number - pageno <= CLOG_WRAP_CHECK_DELTA + && pageno <= ClogCtl->shared->latest_page_number) + { + LWLockRelease(CLogControlLock); + return; + } +#endif + /* Zero the page and make an XLOG entry about it */ ZeroCLOGPage(pageno, true); diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c index e3c9a64..35c9d83 100644 --- a/src/backend/access/transam/subtrans.c +++ b/src/backend/access/transam/subtrans.c @@ -294,7 +294,6 @@ CheckPointSUBTRANS(void) TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(true); } - /* * Make sure that SUBTRANS has room for a newly-allocated XID. * @@ -325,7 +324,10 @@ ExtendSUBTRANS(TransactionId newestXact) /* * The first condition makes sure we did not wrap around - * The second checks if we are still using the same page + * The second checks if we are still using the same page. + * Note that this value can change and we are not holding a lock, + * so we repeat the check below. We do it this way instead of + * grabbing the lock to avoid lock contention. */ if (SubTransCtl->shared->latest_page_number - pageno <= SUBTRANS_WRAP_CHECK_DELTA && pageno <= SubTransCtl->shared->latest_page_number) @@ -340,6 +342,20 @@ ExtendSUBTRANS(TransactionId newestXact) LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE); +#ifdef PGXC + /* + * We repeat the check. Another process may have written + * out the page already and advanced the latest_page_number + * while we were waiting for the lock. + */ + if (SubTransCtl->shared->latest_page_number - pageno <= SUBTRANS_WRAP_CHECK_DELTA + && pageno <= SubTransCtl->shared->latest_page_number) + { + LWLockRelease(SubtransControlLock); + return; + } +#endif + /* Zero the page */ ZeroSUBTRANSPage(pageno); ----------------------------------------------------------------------- Summary of changes: src/backend/access/transam/clog.c | 17 +++++++++++++++++ src/backend/access/transam/subtrans.c | 20 ++++++++++++++++++-- 2 files changed, 35 insertions(+), 2 deletions(-) hooks/post-receive -- Postgres-XC |
From: mason_s <ma...@us...> - 2010-08-05 18:49:16
|
Project "Postgres-XC". The branch, master has been updated via c7476b9cf075aba2dd2ed11ea57c632c1ad6721a (commit) from 086c5c6be32d4ca9232523cd64caf6d29aaac42c (commit) - Log ----------------------------------------------------------------- commit c7476b9cf075aba2dd2ed11ea57c632c1ad6721a Author: Mason S <mas...@ma...> Date: Thu Aug 5 14:36:37 2010 -0400 Added more handling to deal with data node connection failures. This includes forcing the release of connections in an unexpected state and bug fixes. This was written by Andrei Martsinchyk, with some additional handling added by Mason. diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index e390f8a..82eca8c 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -205,9 +205,9 @@ static char *tpc_b_bid = { "\\setrandom tid 1 :ntellers\n" "\\setrandom delta -5000 5000\n" "BEGIN;\n" - "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid AND bid = :bid;\n" - "SELECT abalance FROM pgbench_accounts WHERE aid = :aid AND bid = :bid\n" - "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid AND bid = :bid;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid\n" + "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" "END;\n" diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 491d0d5..673aad1 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -135,7 +135,7 @@ typedef struct TransactionStateData { TransactionId transactionId; /* my XID, or Invalid if none */ #ifdef PGXC /* PGXC_COORD */ - GlobalTransactionId globalTransactionId; /* my GXID, or Invalid if none */ + GlobalTransactionId globalTransactionId; /* my GXID, or Invalid if none */ #endif SubTransactionId subTransactionId; /* my subxact ID */ char *name; /* savepoint name, if any */ @@ -314,7 +314,7 @@ GetCurrentGlobalTransactionId(void) * GetGlobalTransactionId * * This will return the GXID of the specified transaction, - * getting one from the GTM if it's not yet set. + * getting one from the GTM if it's not yet set. */ static GlobalTransactionId GetGlobalTransactionId(TransactionState s) @@ -469,7 +469,7 @@ AssignTransactionId(TransactionState s) if (IS_PGXC_COORDINATOR) { s->transactionId = (TransactionId) GetGlobalTransactionId(s); - elog(DEBUG1, "New transaction id assigned = %d, isSubXact = %s", + elog(DEBUG1, "New transaction id assigned = %d, isSubXact = %s", s->transactionId, isSubXact ? "true" : "false"); } else @@ -1679,6 +1679,14 @@ CommitTransaction(void) */ AtEOXact_UpdateFlatFiles(true); +#ifdef PGXC + /* + * There can be error on the data nodes. So go to data nodes before + * changing transaction state and local clean up + */ + DataNodeCommit(); +#endif + /* Prevent cancel/die interrupt while cleaning up */ HOLD_INTERRUPTS(); @@ -1694,13 +1702,13 @@ CommitTransaction(void) latestXid = RecordTransactionCommit(); TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid); + #ifdef PGXC + /* + * Now we can let GTM know about transaction commit + */ if (IS_PGXC_COORDINATOR) { - /* Make sure this committed on the DataNodes, - * if so it will just return - */ - DataNodeCommit(DestNone); CommitTranGTM(s->globalTransactionId); latestXid = s->globalTransactionId; } @@ -1712,7 +1720,7 @@ CommitTransaction(void) CommitTranGTM((GlobalTransactionId) latestXid); } #endif - + /* * Let others know about no transaction in progress by me. Note that this * must be done _before_ releasing locks we hold and _after_ @@ -1808,7 +1816,7 @@ CommitTransaction(void) s->nChildXids = 0; s->maxChildXids = 0; -#ifdef PGXC +#ifdef PGXC if (IS_PGXC_COORDINATOR) s->globalTransactionId = InvalidGlobalTransactionId; else if (IS_PGXC_DATANODE) @@ -2143,10 +2151,10 @@ AbortTransaction(void) #ifdef PGXC if (IS_PGXC_COORDINATOR) { - /* Make sure this is rolled back on the DataNodes, - * if so it will just return + /* Make sure this is rolled back on the DataNodes, + * if so it will just return */ - DataNodeRollback(DestNone); + DataNodeRollback(); RollbackTranGTM(s->globalTransactionId); latestXid = s->globalTransactionId; } diff --git a/src/backend/pgxc/pool/datanode.c b/src/backend/pgxc/pool/datanode.c index 517b1e4..0f4072d 100644 --- a/src/backend/pgxc/pool/datanode.c +++ b/src/backend/pgxc/pool/datanode.c @@ -199,6 +199,9 @@ data_node_init(DataNodeHandle *handle, int sock, int nodenum) handle->sock = sock; handle->transaction_status = 'I'; handle->state = DN_CONNECTION_STATE_IDLE; +#ifdef DN_CONNECTION_DEBUG + handle->have_row_desc = false; +#endif handle->error = NULL; handle->outEnd = 0; handle->inStart = 0; @@ -211,7 +214,7 @@ data_node_init(DataNodeHandle *handle, int sock, int nodenum) * Wait while at least one of specified connections has data available and read * the data into the buffer */ -void +int data_node_receive(const int conn_count, DataNodeHandle ** connections, struct timeval * timeout) { @@ -239,7 +242,7 @@ data_node_receive(const int conn_count, * Return if we do not have connections to receive input */ if (nfds == 0) - return; + return 0; retry: res_select = select(nfds + 1, &readfds, NULL, NULL, timeout); @@ -249,27 +252,19 @@ retry: if (errno == EINTR || errno == EAGAIN) goto retry; - /* - * PGXCTODO - we may want to close the connections and notify the - * pooler that these are invalid. - */ if (errno == EBADF) { - ereport(ERROR, - (errcode(ERRCODE_CONNECTION_FAILURE), - errmsg("select() bad file descriptor set"))); + elog(WARNING, "select() bad file descriptor set"); } - ereport(ERROR, - (errcode(ERRCODE_CONNECTION_FAILURE), - errmsg("select() error: %d", errno))); + elog(WARNING, "select() error: %d", errno); + return errno; } if (res_select == 0) { /* Handle timeout */ - ereport(ERROR, - (errcode(ERRCODE_CONNECTION_FAILURE), - errmsg("timeout while waiting for response"))); + elog(WARNING, "timeout while waiting for response"); + return EOF; } /* read data */ @@ -283,10 +278,9 @@ retry: if (read_status == EOF || read_status < 0) { - /* PGXCTODO - we should notify the pooler to destroy the connections */ - ereport(ERROR, - (errcode(ERRCODE_CONNECTION_FAILURE), - errmsg("unexpected EOF on datanode connection"))); + add_error_message(conn, "unexpected EOF on datanode connection"); + elog(WARNING, "unexpected EOF on datanode connection"); + return EOF; } else { @@ -294,6 +288,7 @@ retry: } } } + return 0; } @@ -522,7 +517,7 @@ get_message(DataNodeHandle *conn, int *len, char **msg) * ensure_in_buffer_capacity() will immediately return */ ensure_in_buffer_capacity(5 + (size_t) *len, conn); - conn->state == DN_CONNECTION_STATE_QUERY; + conn->state = DN_CONNECTION_STATE_QUERY; conn->inCursor = conn->inStart; return '\0'; } @@ -539,19 +534,27 @@ void release_handles(void) { int i; + int discard[NumDataNodes]; + int ndisc = 0; if (node_count == 0) return; - PoolManagerReleaseConnections(); for (i = 0; i < NumDataNodes; i++) { DataNodeHandle *handle = &handles[i]; if (handle->sock != NO_SOCKET) + { + if (handle->state != DN_CONNECTION_STATE_IDLE) + { + elog(WARNING, "Connection to data node %d has unexpected state %d and will be dropped", handle->nodenum, handle->state); + discard[ndisc++] = handle->nodenum; + } data_node_free(handle); + } } - + PoolManagerReleaseConnections(ndisc, discard); node_count = 0; } @@ -897,7 +900,7 @@ void add_error_message(DataNodeHandle *handle, const char *message) { handle->transaction_status = 'E'; - handle->state = DN_CONNECTION_STATE_IDLE; + handle->state = DN_CONNECTION_STATE_ERROR_NOT_READY; if (handle->error) { /* PGXCTODO append */ diff --git a/src/backend/pgxc/pool/execRemote.c b/src/backend/pgxc/pool/execRemote.c index c6f9042..bbedef0 100644 --- a/src/backend/pgxc/pool/execRemote.c +++ b/src/backend/pgxc/pool/execRemote.c @@ -24,6 +24,7 @@ #include "miscadmin.h" #include "pgxc/execRemote.h" #include "pgxc/poolmgr.h" +#include "storage/ipc.h" #include "utils/datum.h" #include "utils/memutils.h" #include "utils/tuplesort.h" @@ -40,9 +41,10 @@ static bool autocommit = true; static DataNodeHandle **write_node_list = NULL; static int write_node_count = 0; -static int data_node_begin(int conn_count, DataNodeHandle ** connections, CommandDest dest, GlobalTransactionId gxid); -static int data_node_commit(int conn_count, DataNodeHandle ** connections, CommandDest dest); -static int data_node_rollback(int conn_count, DataNodeHandle ** connections, CommandDest dest); +static int data_node_begin(int conn_count, DataNodeHandle ** connections, + GlobalTransactionId gxid); +static int data_node_commit(int conn_count, DataNodeHandle ** connections); +static int data_node_rollback(int conn_count, DataNodeHandle ** connections); static void clear_write_node_list(); @@ -920,7 +922,7 @@ FetchTuple(RemoteQueryState *combiner, TupleTableSlot *slot) /* * Handle responses from the Data node connections */ -static void +static int data_node_receive_responses(const int conn_count, DataNodeHandle ** connections, struct timeval * timeout, RemoteQueryState *combiner) { @@ -940,7 +942,8 @@ data_node_receive_responses(const int conn_count, DataNodeHandle ** connections, { int i = 0; - data_node_receive(count, to_receive, timeout); + if (data_node_receive(count, to_receive, timeout)) + return EOF; while (i < count) { int result = handle_response(to_receive[i], combiner); @@ -959,12 +962,17 @@ data_node_receive_responses(const int conn_count, DataNodeHandle ** connections, break; default: /* Inconsistent responses */ - ereport(ERROR, - (errcode(ERRCODE_INTERNAL_ERROR), - errmsg("Unexpected response from the data nodes, result = %d, request type %d", result, combiner->request_type))); + add_error_message(to_receive[i], "Unexpected response from the data nodes"); + elog(WARNING, "Unexpected response from the data nodes, result = %d, request type %d", result, combiner->request_type); + /* Stop tracking and move last connection in place */ + count--; + if (i < count) + to_receive[i] = to_receive[count]; } } } + + return 0; } /* @@ -990,6 +998,18 @@ handle_response(DataNodeHandle * conn, RemoteQueryState *combiner) if (conn->state == DN_CONNECTION_STATE_QUERY) return RESPONSE_EOF; + /* + * If we are in the process of shutting down, we + * may be rolling back, and the buffer may contain other messages. + * We want to avoid a procarray exception + * as well as an error stack overflow. + */ + if (proc_exit_inprogress) + { + conn->state = DN_CONNECTION_STATE_ERROR_FATAL; + return RESPONSE_EOF; + } + /* TODO handle other possible responses */ switch (get_message(conn, &msg_len, &msg)) { @@ -1005,10 +1025,17 @@ handle_response(DataNodeHandle * conn, RemoteQueryState *combiner) HandleCommandComplete(combiner, msg, msg_len); break; case 'T': /* RowDescription */ +#ifdef DN_CONNECTION_DEBUG + Assert(!conn->have_row_desc); + conn->have_row_desc = true; +#endif if (HandleRowDescription(combiner, msg, msg_len)) return RESPONSE_TUPDESC; break; case 'D': /* DataRow */ +#ifdef DN_CONNECTION_DEBUG + Assert(conn->have_row_desc); +#endif HandleDataRow(combiner, msg, msg_len); return RESPONSE_DATAROW; case 'G': /* CopyInResponse */ @@ -1042,6 +1069,9 @@ handle_response(DataNodeHandle * conn, RemoteQueryState *combiner) case 'Z': /* ReadyForQuery */ conn->transaction_status = msg[0]; conn->state = DN_CONNECTION_STATE_IDLE; +#ifdef DN_CONNECTION_DEBUG + conn->have_row_desc = false; +#endif return RESPONSE_COMPLETE; case 'I': /* EmptyQuery */ default: @@ -1058,7 +1088,8 @@ handle_response(DataNodeHandle * conn, RemoteQueryState *combiner) * Send BEGIN command to the Data nodes and receive responses */ static int -data_node_begin(int conn_count, DataNodeHandle ** connections, CommandDest dest, GlobalTransactionId gxid) +data_node_begin(int conn_count, DataNodeHandle ** connections, + GlobalTransactionId gxid) { int i; struct timeval *timeout = NULL; @@ -1078,7 +1109,8 @@ data_node_begin(int conn_count, DataNodeHandle ** connections, CommandDest dest, combiner->dest = None_Receiver; /* Receive responses */ - data_node_receive_responses(conn_count, connections, timeout, combiner); + if (data_node_receive_responses(conn_count, connections, timeout, combiner)) + return EOF; /* Verify status */ return ValidateAndCloseCombiner(combiner) ? 0 : EOF; @@ -1109,12 +1141,12 @@ DataNodeBegin(void) /* - * Commit current transaction, use two-phase commit if necessary + * Commit current transaction on data nodes where it has been started */ -int -DataNodeCommit(CommandDest dest) +void +DataNodeCommit(void) { - int res; + int res = 0; int tran_count; DataNodeHandle *connections[NumDataNodes]; @@ -1128,7 +1160,7 @@ DataNodeCommit(CommandDest dest) if (tran_count == 0) goto finish; - res = data_node_commit(tran_count, connections, dest); + res = data_node_commit(tran_count, connections); finish: /* In autocommit mode statistics is collected in DataNodeExec */ @@ -1138,15 +1170,19 @@ finish: release_handles(); autocommit = true; clear_write_node_list(); - return res; + if (res != 0) + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Could not commit connection on data nodes"))); } /* - * Send COMMIT or PREPARE/COMMIT PREPARED down to the Data nodes and handle responses + * Commit transaction on specified data node connections, use two-phase commit + * if more then on one node data have been modified during the transactioon. */ static int -data_node_commit(int conn_count, DataNodeHandle ** connections, CommandDest dest) +data_node_commit(int conn_count, DataNodeHandle ** connections) { int i; struct timeval *timeout = NULL; @@ -1191,13 +1227,12 @@ data_node_commit(int conn_count, DataNodeHandle ** connections, CommandDest dest combiner = CreateResponseCombiner(conn_count, COMBINE_TYPE_NONE); combiner->dest = None_Receiver; /* Receive responses */ - data_node_receive_responses(conn_count, connections, timeout, combiner); + if (data_node_receive_responses(conn_count, connections, timeout, combiner)) + result = EOF; /* Reset combiner */ if (!ValidateAndResetCombiner(combiner)) - { result = EOF; - } } if (!do2PC) @@ -1238,7 +1273,8 @@ data_node_commit(int conn_count, DataNodeHandle ** connections, CommandDest dest combiner->dest = None_Receiver; } /* Receive responses */ - data_node_receive_responses(conn_count, connections, timeout, combiner); + if (data_node_receive_responses(conn_count, connections, timeout, combiner)) + result = EOF; result = ValidateAndCloseCombiner(combiner) ? result : EOF; finish: @@ -1253,7 +1289,7 @@ finish: * Rollback current transaction */ int -DataNodeRollback(CommandDest dest) +DataNodeRollback(void) { int res = 0; int tran_count; @@ -1269,7 +1305,7 @@ DataNodeRollback(CommandDest dest) if (tran_count == 0) goto finish; - res = data_node_rollback(tran_count, connections, dest); + res = data_node_rollback(tran_count, connections); finish: /* In autocommit mode statistics is collected in DataNodeExec */ @@ -1287,24 +1323,23 @@ finish: * Send ROLLBACK command down to the Data nodes and handle responses */ static int -data_node_rollback(int conn_count, DataNodeHandle ** connections, CommandDest dest) +data_node_rollback(int conn_count, DataNodeHandle ** connections) { int i; struct timeval *timeout = NULL; - int result = 0; RemoteQueryState *combiner; /* Send ROLLBACK - */ for (i = 0; i < conn_count; i++) { - if (data_node_send_query(connections[i], "ROLLBACK")) - result = EOF; + data_node_send_query(connections[i], "ROLLBACK"); } combiner = CreateResponseCombiner(conn_count, COMBINE_TYPE_NONE); combiner->dest = None_Receiver; /* Receive responses */ - data_node_receive_responses(conn_count, connections, timeout, combiner); + if (data_node_receive_responses(conn_count, connections, timeout, combiner)) + return EOF; /* Verify status */ return ValidateAndCloseCombiner(combiner) ? 0 : EOF; @@ -1404,7 +1439,7 @@ DataNodeCopyBegin(const char *query, List *nodelist, Snapshot snapshot, bool is_ if (new_count > 0 && need_tran) { /* Start transaction on connections where it is not started */ - if (data_node_begin(new_count, newConnections, DestNone, gxid)) + if (data_node_begin(new_count, newConnections, gxid)) { pfree(connections); pfree(copy_connections); @@ -1448,8 +1483,8 @@ DataNodeCopyBegin(const char *query, List *nodelist, Snapshot snapshot, bool is_ combiner->dest = None_Receiver; /* Receive responses */ - data_node_receive_responses(conn_count, connections, timeout, combiner); - if (!ValidateAndCloseCombiner(combiner)) + if (data_node_receive_responses(conn_count, connections, timeout, combiner) + || !ValidateAndCloseCombiner(combiner)) { if (autocommit) { @@ -1665,6 +1700,12 @@ DataNodeCopyOut(Exec_Nodes *exec_nodes, DataNodeHandle** copy_connections, FILE* ereport(ERROR, (errcode(ERRCODE_CONNECTION_FAILURE), errmsg("unexpected EOF on datanode connection"))); + else + /* + * Set proper connection status - handle_response + * has changed it to DN_CONNECTION_STATE_QUERY + */ + handle->state = DN_CONNECTION_STATE_COPY_OUT; } /* There is no more data that can be read from connection */ } @@ -1746,7 +1787,7 @@ DataNodeCopyFinish(DataNodeHandle** copy_connections, int primary_data_node, combiner = CreateResponseCombiner(conn_count + 1, combine_type); combiner->dest = None_Receiver; - data_node_receive_responses(1, &primary_handle, timeout, combiner); + error = data_node_receive_responses(1, &primary_handle, timeout, combiner) || error; } for (i = 0; i < conn_count; i++) @@ -1786,30 +1827,14 @@ DataNodeCopyFinish(DataNodeHandle** copy_connections, int primary_data_node, combiner = CreateResponseCombiner(conn_count, combine_type); combiner->dest = None_Receiver; } - data_node_receive_responses(conn_count, connections, timeout, combiner); + error = (data_node_receive_responses(conn_count, connections, timeout, combiner) != 0) || error; processed = combiner->row_count; if (!ValidateAndCloseCombiner(combiner) || error) - { - if (autocommit) - { - if (need_tran) - DataNodeRollback(DestNone); - else - if (!PersistentConnections) release_handles(); - } - - return 0; - } - - if (autocommit) - { - if (need_tran) - DataNodeCommit(DestNone); - else - if (!PersistentConnections) release_handles(); - } + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Error while running COPY"))); return processed; } @@ -1882,9 +1907,17 @@ copy_slot(RemoteQueryState *node, TupleTableSlot *src, TupleTableSlot *dst) else { int i; + + /* + * Data node may be sending junk columns which are always at the end, + * but it must not be shorter then result slot. + */ + Assert(dst->tts_tupleDescriptor->natts <= src->tts_tupleDescriptor->natts); ExecClearTuple(dst); slot_getallattrs(src); - /* PGXCTODO revisit: probably incorrect */ + /* + * PGXCTODO revisit: if it is correct to copy Datums using assignment? + */ for (i = 0; i < dst->tts_tupleDescriptor->natts; i++) { dst->tts_values[i] = src->tts_values[i]; @@ -1911,6 +1944,8 @@ ExecRemoteQuery(RemoteQueryState *node) EState *estate = node->ss.ps.state; TupleTableSlot *resultslot = node->ss.ps.ps_ResultTupleSlot; TupleTableSlot *scanslot = node->ss.ss_ScanTupleSlot; + bool have_tuple = false; + if (!node->query_Done) { @@ -1974,7 +2009,7 @@ ExecRemoteQuery(RemoteQueryState *node) else need_tran = !autocommit || total_conn_count > 1; - elog(DEBUG1, "autocommit = %s, has primary = %s, regular_conn_count = %d, need_tran = %s", autocommit ? "true" : "false", primarynode ? "true" : "false", regular_conn_count, need_tran ? "true" : "false"); + elog(DEBUG1, "autocommit = %s, has primary = %s, regular_conn_count = %d, statement_need_tran = %s", autocommit ? "true" : "false", primarynode ? "true" : "false", regular_conn_count, need_tran ? "true" : "false"); stat_statement(); if (autocommit) @@ -2052,7 +2087,7 @@ ExecRemoteQuery(RemoteQueryState *node) new_connections[new_count++] = connections[i]; if (new_count) - data_node_begin(new_count, new_connections, DestNone, gxid); + data_node_begin(new_count, new_connections, gxid); } /* See if we have a primary nodes, execute on it first before the others */ @@ -2088,36 +2123,22 @@ ExecRemoteQuery(RemoteQueryState *node) while (node->command_complete_count < 1) { - PG_TRY(); - { - data_node_receive(1, primaryconnection, NULL); - while (handle_response(primaryconnection[0], node) == RESPONSE_EOF) - data_node_receive(1, primaryconnection, NULL); - if (node->errorMessage) - { - char *code = node->errorCode; + if (data_node_receive(1, primaryconnection, NULL)) + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to read response from data nodes"))); + while (handle_response(primaryconnection[0], node) == RESPONSE_EOF) + if (data_node_receive(1, primaryconnection, NULL)) ereport(ERROR, - (errcode(MAKE_SQLSTATE(code[0], code[1], code[2], code[3], code[4])), - errmsg("%s", node->errorMessage))); - } - } - /* If we got an error response return immediately */ - PG_CATCH(); + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to read response from data nodes"))); + if (node->errorMessage) { - /* We are going to exit, so release combiner */ - if (autocommit) - { - if (need_tran) - DataNodeRollback(DestNone); - else if (!PersistentConnections) - release_handles(); - } - - pfree(primaryconnection); - pfree(connections); - PG_RE_THROW(); + char *code = node->errorCode; + ereport(ERROR, + (errcode(MAKE_SQLSTATE(code[0], code[1], code[2], code[3], code[4])), + errmsg("%s", node->errorMessage))); } - PG_END_TRY(); } pfree(primaryconnection); } @@ -2148,8 +2169,6 @@ ExecRemoteQuery(RemoteQueryState *node) } } - PG_TRY(); - { /* * Stop if all commands are completed or we got a data row and * initialized state node for subsequent invocations @@ -2158,7 +2177,10 @@ ExecRemoteQuery(RemoteQueryState *node) { int i = 0; - data_node_receive(regular_conn_count, connections, NULL); + if (data_node_receive(regular_conn_count, connections, NULL)) + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to read response from data nodes"))); /* * Handle input from the data nodes. * If we got a RESPONSE_DATAROW we can break handling to wrap @@ -2234,185 +2256,148 @@ ExecRemoteQuery(RemoteQueryState *node) } } } - } - /* If we got an error response return immediately */ - PG_CATCH(); - { - /* We are going to exit, so release combiner */ - if (autocommit) - { - if (need_tran) - DataNodeRollback(DestNone); - else if (!PersistentConnections) - release_handles(); - } - PG_RE_THROW(); - } - PG_END_TRY(); + node->query_Done = true; - node->need_tran = need_tran; } - PG_TRY(); + if (node->tuplesortstate) { - bool have_tuple = false; - - if (node->tuplesortstate) + while (tuplesort_gettupleslot((Tuplesortstate *) node->tuplesortstate, + true, scanslot)) { - while (tuplesort_gettupleslot((Tuplesortstate *) node->tuplesortstate, - true, scanslot)) + have_tuple = true; + /* + * If DISTINCT is specified and current tuple matches to + * previous skip it and get next one. + * Othervise return current tuple + */ + if (step->distinct) { - have_tuple = true; /* - * If DISTINCT is specified and current tuple matches to - * previous skip it and get next one. - * Othervise return current tuple + * Always receive very first tuple and + * skip to next if scan slot match to previous (result slot) */ - if (step->distinct) + if (!TupIsNull(resultslot) && + execTuplesMatch(scanslot, + resultslot, + step->distinct->numCols, + step->distinct->uniqColIdx, + node->eqfunctions, + node->tmp_ctx)) { - /* - * Always receive very first tuple and - * skip to next if scan slot match to previous (result slot) - */ - if (!TupIsNull(resultslot) && - execTuplesMatch(scanslot, - resultslot, - step->distinct->numCols, - step->distinct->uniqColIdx, - node->eqfunctions, - node->tmp_ctx)) - { - have_tuple = false; - continue; - } + have_tuple = false; + continue; } - copy_slot(node, scanslot, resultslot); - (*node->dest->receiveSlot) (resultslot, node->dest); - break; } - if (!have_tuple) - ExecClearTuple(resultslot); + copy_slot(node, scanslot, resultslot); + (*node->dest->receiveSlot) (resultslot, node->dest); + break; } - else + if (!have_tuple) + ExecClearTuple(resultslot); + } + else + { + while (node->conn_count > 0 && !have_tuple) { - while (node->conn_count > 0 && !have_tuple) - { - int i; + int i; - /* - * If combiner already has tuple go ahead and return it - * otherwise tuple will be cleared - */ - if (FetchTuple(node, scanslot) && !TupIsNull(scanslot)) + /* + * If combiner already has tuple go ahead and return it + * otherwise tuple will be cleared + */ + if (FetchTuple(node, scanslot) && !TupIsNull(scanslot)) + { + if (node->simple_aggregates) { - if (node->simple_aggregates) - { - /* - * Advance aggregate functions and allow to read up next - * data row message and get tuple in the same slot on - * next iteration - */ - exec_simple_aggregates(node, scanslot); - } - else - { - /* - * Receive current slot and read up next data row - * message before exiting the loop. Next time when this - * function is invoked we will have either data row - * message ready or EOF - */ - copy_slot(node, scanslot, resultslot); - (*node->dest->receiveSlot) (resultslot, node->dest); - have_tuple = true; - } + /* + * Advance aggregate functions and allow to read up next + * data row message and get tuple in the same slot on + * next iteration + */ + exec_simple_aggregates(node, scanslot); } - - /* - * Handle input to get next row or ensure command is completed, - * starting from connection next after current. If connection - * does not - */ - if ((i = node->current_conn + 1) == node->conn_count) - i = 0; - - for (;;) + else { - int res = handle_response(node->connections[i], node); - if (res == RESPONSE_EOF) - { - /* go to next connection */ - if (++i == node->conn_count) - i = 0; - /* if we cycled over all connections we need to receive more */ - if (i == node->current_conn) - data_node_receive(node->conn_count, node->connections, NULL); - } - else if (res == RESPONSE_COMPLETE) - { - if (--node->conn_count == 0) - break; - if (i == node->conn_count) - i = 0; - else - node->connections[i] = node->connections[node->conn_count]; - if (node->current_conn == node->conn_count) - node->current_conn = i; - } - else if (res == RESPONSE_DATAROW) - { - node->current_conn = i; - break; - } + /* + * Receive current slot and read up next data row + * message before exiting the loop. Next time when this + * function is invoked we will have either data row + * message ready or EOF + */ + copy_slot(node, scanslot, resultslot); + (*node->dest->receiveSlot) (resultslot, node->dest); + have_tuple = true; } } /* - * We may need to finalize aggregates + * Handle input to get next row or ensure command is completed, + * starting from connection next after current. If connection + * does not */ - if (!have_tuple && node->simple_aggregates) + if ((i = node->current_conn + 1) == node->conn_count) + i = 0; + + for (;;) { - finish_simple_aggregates(node, resultslot); - if (!TupIsNull(resultslot)) + int res = handle_response(node->connections[i], node); + if (res == RESPONSE_EOF) { - (*node->dest->receiveSlot) (resultslot, node->dest); - have_tuple = true; + /* go to next connection */ + if (++i == node->conn_count) + i = 0; + /* if we cycled over all connections we need to receive more */ + if (i == node->current_conn) + if (data_node_receive(node->conn_count, node->connections, NULL)) + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Failed to read response from data nodes"))); + } + else if (res == RESPONSE_COMPLETE) + { + if (--node->conn_count == 0) + break; + if (i == node->conn_count) + i = 0; + else + node->connections[i] = node->connections[node->conn_count]; + if (node->current_conn == node->conn_count) + node->current_conn = i; + } + else if (res == RESPONSE_DATAROW) + { + node->current_conn = i; + break; } } - - if (!have_tuple) /* report end of scan */ - ExecClearTuple(resultslot); - } - if (node->errorMessage) + /* + * We may need to finalize aggregates + */ + if (!have_tuple && node->simple_aggregates) { - char *code = node->errorCode; - ereport(ERROR, - (errcode(MAKE_SQLSTATE(code[0], code[1], code[2], code[3], code[4])), - errmsg("%s", node->errorMessage))); + finish_simple_aggregates(node, resultslot); + if (!TupIsNull(resultslot)) + { + (*node->dest->receiveSlot) (resultslot, node->dest); + have_tuple = true; + } } - /* - * If command is completed we should commit work. - */ - if (node->conn_count == 0 && autocommit && node->need_tran) - DataNodeCommit(DestNone); + if (!have_tuple) /* report end of scan */ + ExecClearTuple(resultslot); + } - /* If we got an error response return immediately */ - PG_CATCH(); + + if (node->errorMessage) { - /* We are going to exit, so release combiner */ - if (autocommit) - { - if (node->need_tran) - DataNodeRollback(DestNone); - else if (!PersistentConnections) - release_handles(); - } - PG_RE_THROW(); + char *code = node->errorCode; + ereport(ERROR, + (errcode(MAKE_SQLSTATE(code[0], code[1], code[2], code[3], code[4])), + errmsg("%s", node->errorMessage))); } - PG_END_TRY(); return resultslot; } @@ -2436,7 +2421,7 @@ DataNodeCleanAndRelease(int code, Datum arg) /* Rollback on Data Nodes */ if (IsTransactionState()) { - DataNodeRollback(DestNone); + DataNodeRollback(); /* Rollback on GTM if transaction id opened. */ RollbackTranGTM((GlobalTransactionId) GetCurrentTransactionIdIfAny()); diff --git a/src/backend/pgxc/pool/poolcomm.c b/src/backend/pgxc/pool/poolcomm.c index 4625261..7e4771c 100644 --- a/src/backend/pgxc/pool/poolcomm.c +++ b/src/backend/pgxc/pool/poolcomm.c @@ -22,7 +22,9 @@ #include <errno.h> #include <stddef.h> #include "c.h" +#include "postgres.h" #include "pgxc/poolcomm.h" +#include "storage/ipc.h" #include "utils/elog.h" #include "miscadmin.h" @@ -408,9 +410,16 @@ pool_flush(PoolPort *port) if (errno != last_reported_send_errno) { last_reported_send_errno = errno; - ereport(ERROR, - (errcode_for_socket_access(), - errmsg("could not send data to client: %m"))); + + /* + * Handle a seg fault that may later occur in proc array + * when this fails when we are already shutting down + * If shutting down already, do not call. + */ + if (!proc_exit_inprogress) + ereport(ERROR, + (errcode_for_socket_access(), + errmsg("could not send data to client: %m"))); } /* diff --git a/src/backend/pgxc/pool/poolmgr.c b/src/backend/pgxc/pool/poolmgr.c index 6427da3..dbb8aed 100644 --- a/src/backend/pgxc/pool/poolmgr.c +++ b/src/backend/pgxc/pool/poolmgr.c @@ -93,7 +93,7 @@ static DatabasePool *find_database_pool(const char *database); static DatabasePool *remove_database_pool(const char *database); static int *agent_acquire_connections(PoolAgent *agent, List *nodelist); static DataNodePoolSlot *acquire_connection(DatabasePool *dbPool, int node); -static void agent_release_connections(PoolAgent *agent, bool clean); +static void agent_release_connections(PoolAgent *agent, List *discard); static void release_connection(DatabasePool *dbPool, DataNodePoolSlot *slot, int index, bool clean); static void destroy_slot(DataNodePoolSlot *slot); static void grow_pool(DatabasePool *dbPool, int index); @@ -587,7 +587,7 @@ agent_init(PoolAgent *agent, const char *database, List *nodes) /* disconnect if we still connected */ if (agent->pool) - agent_release_connections(agent, false); + agent_release_connections(agent, NULL); /* find database */ agent->pool = find_database_pool(database); @@ -612,7 +612,7 @@ agent_destroy(PoolAgent *agent) /* Discard connections if any remaining */ if (agent->pool) - agent_release_connections(agent, false); + agent_release_connections(agent, NULL); /* find agent in the list */ for (i = 0; i < agentCount; i++) @@ -700,11 +700,6 @@ static void agent_handle_input(PoolAgent * agent, StringInfo s) { int qtype; - const char *database; - int nodecount; - List *nodelist = NIL; - int *fds; - int i; qtype = pool_getbyte(&agent->port); /* @@ -712,6 +707,12 @@ agent_handle_input(PoolAgent * agent, StringInfo s) */ for (;;) { + const char *database; + int nodecount; + List *nodelist = NIL; + int *fds; + int i; + switch (qtype) { case 'c': /* CONNECT */ @@ -729,9 +730,7 @@ agent_handle_input(PoolAgent * agent, StringInfo s) pool_getmessage(&agent->port, s, 4 * NumDataNodes + 8); nodecount = pq_getmsgint(s, 4); for (i = 0; i < nodecount; i++) - { nodelist = lappend_int(nodelist, pq_getmsgint(s, 4)); - } pq_getmsgend(s); /* * In case of error agent_acquire_connections will log @@ -744,9 +743,13 @@ agent_handle_input(PoolAgent * agent, StringInfo s) pfree(fds); break; case 'r': /* RELEASE CONNECTIONS */ - pool_getmessage(&agent->port, s, 4); + pool_getmessage(&agent->port, s, 4 * NumDataNodes + 8); + nodecount = pq_getmsgint(s, 4); + for (i = 0; i < nodecount; i++) + nodelist = lappend_int(nodelist, pq_getmsgint(s, 4)); pq_getmsgend(s); - agent_release_connections(agent, true); + agent_release_connections(agent, nodelist); + list_free(nodelist); break; default: /* EOF or protocol violation */ agent_destroy(agent); @@ -831,11 +834,24 @@ agent_acquire_connections(PoolAgent *agent, List *nodelist) * Retun connections back to the pool */ void -PoolManagerReleaseConnections(void) +PoolManagerReleaseConnections(int ndisc, int* discard) { + uint32 n32; + uint32 buf[1 + ndisc]; + int i; + Assert(Handle); - pool_putmessage(&Handle->port, 'r', NULL, 0); + n32 = htonl((uint32) ndisc); + buf[0] = n32; + + for (i = 0; i < ndisc;) + { + n32 = htonl((uint32) discard[i++]); + buf[i] = n32; + } + pool_putmessage(&Handle->port, 'r', (char *) buf, + (1 + ndisc) * sizeof(uint32)); pool_flush(&Handle->port); } @@ -844,23 +860,40 @@ PoolManagerReleaseConnections(void) * Release connections */ static void -agent_release_connections(PoolAgent *agent, bool clean) +agent_release_connections(PoolAgent *agent, List *discard) { int i; + DataNodePoolSlot *slot; + if (!agent->connections) return; - /* Enumerate connections */ - for (i = 0; i < NumDataNodes; i++) + if (discard) { - DataNodePoolSlot *slot; + ListCell *lc; + foreach(lc, discard) + { + int node = lfirst_int(lc); + Assert(node > 0 && node <= NumDataNodes); + slot = agent->connections[node - 1]; + + /* Discard connection */ + if (slot) + release_connection(agent->pool, slot, node - 1, false); + agent->connections[node - 1] = NULL; + } + } + + /* Remaining connections are assumed to be clean */ + for (i = 0; i < NumDataNodes; i++) + { slot = agent->connections[i]; /* Release connection */ if (slot) - release_connection(agent->pool, slot, i, clean); + release_connection(agent->pool, slot, i, true); agent->connections[i] = NULL; } } diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 553a682..ea66125 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -4393,11 +4393,11 @@ pgxc_transaction_stmt (Node *parsetree) break; case TRANS_STMT_COMMIT: - DataNodeCommit(DestNone); + DataNodeCommit(); break; case TRANS_STMT_ROLLBACK: - DataNodeRollback(DestNone); + DataNodeRollback(); break; default: diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 077a589..75f1f41 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -2875,12 +2875,16 @@ reversedirection_heap(Tuplesortstate *state) static unsigned int getlen_datanode(Tuplesortstate *state, int tapenum, bool eofOK) { + DataNodeHandle *conn = state->combiner->connections[tapenum]; for (;;) { - switch (handle_response(state->combiner->connections[tapenum], state->combiner)) + switch (handle_response(conn, state->combiner)) { case RESPONSE_EOF: - data_node_receive(1, state->combiner->connections + tapenum, NULL); + if (data_node_receive(1, &conn, NULL)) + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg(conn->error))); break; case RESPONSE_COMPLETE: if (eofOK) diff --git a/src/include/pgxc/datanode.h b/src/include/pgxc/datanode.h index ab95022..849d84a 100644 --- a/src/include/pgxc/datanode.h +++ b/src/include/pgxc/datanode.h @@ -50,6 +50,9 @@ struct data_node_handle /* Connection state */ char transaction_status; DNConnectionState state; +#ifdef DN_CONNECTION_DEBUG + bool have_row_desc; +#endif char *error; /* Output buffer */ char *outBuffer; @@ -86,7 +89,7 @@ extern int data_node_send_query(DataNodeHandle * handle, const char *query); extern int data_node_send_gxid(DataNodeHandle * handle, GlobalTransactionId gxid); extern int data_node_send_snapshot(DataNodeHandle * handle, Snapshot snapshot); -extern void data_node_receive(const int conn_count, +extern int data_node_receive(const int conn_count, DataNodeHandle ** connections, struct timeval * timeout); extern int data_node_read_data(DataNodeHandle * conn); extern int send_some(DataNodeHandle * handle, int len); diff --git a/src/include/pgxc/execRemote.h b/src/include/pgxc/execRemote.h index d99806a..e9b59cc 100644 --- a/src/include/pgxc/execRemote.h +++ b/src/include/pgxc/execRemote.h @@ -62,7 +62,6 @@ typedef struct RemoteQueryState char errorCode[5]; /* error code to send back to client */ char *errorMessage; /* error message to send back to client */ bool query_Done; /* query has been sent down to data nodes */ - bool need_tran; /* auto commit on nodes after completion */ char *completionTag; /* completion tag to present to caller */ char *msg; /* last data row message */ int msglen; /* length of the data row message */ @@ -81,8 +80,8 @@ typedef struct RemoteQueryState /* Multinode Executor */ extern void DataNodeBegin(void); -extern int DataNodeCommit(CommandDest dest); -extern int DataNodeRollback(CommandDest dest); +extern void DataNodeCommit(void); +extern int DataNodeRollback(void); extern DataNodeHandle** DataNodeCopyBegin(const char *query, List *nodelist, Snapshot snapshot, bool is_from); extern int DataNodeCopyIn(char *data_row, int len, Exec_Nodes *exec_nodes, DataNodeHandle** copy_connections); diff --git a/src/include/pgxc/poolmgr.h b/src/include/pgxc/poolmgr.h index 2c9128e..b7ac3ae 100644 --- a/src/include/pgxc/poolmgr.h +++ b/src/include/pgxc/poolmgr.h @@ -45,7 +45,7 @@ typedef struct char *connstr; int freeSize; /* available connections */ int size; /* total pool size */ - DataNodePoolSlot **slot; + DataNodePoolSlot **slot; } DataNodePool; /* All pools for specified database */ @@ -57,7 +57,7 @@ typedef struct databasepool struct databasepool *next; } DatabasePool; -/* Agent of client session (Pool Manager side) +/* Agent of client session (Pool Manager side) * Acts as a session manager, grouping connections together */ typedef struct @@ -125,6 +125,6 @@ extern void PoolManagerConnect(PoolHandle *handle, const char *database); extern int *PoolManagerGetConnections(List *nodelist); /* Retun connections back to the pool */ -extern void PoolManagerReleaseConnections(void); +extern void PoolManagerReleaseConnections(int ndisc, int* discard); #endif ----------------------------------------------------------------------- Summary of changes: contrib/pgbench/pgbench.c | 6 +- src/backend/access/transam/xact.c | 32 ++- src/backend/pgxc/pool/datanode.c | 49 ++-- src/backend/pgxc/pool/execRemote.c | 449 +++++++++++++++++------------------- src/backend/pgxc/pool/poolcomm.c | 15 +- src/backend/pgxc/pool/poolmgr.c | 71 ++++-- src/backend/tcop/postgres.c | 4 +- src/backend/utils/sort/tuplesort.c | 8 +- src/include/pgxc/datanode.h | 5 +- src/include/pgxc/execRemote.h | 5 +- src/include/pgxc/poolmgr.h | 6 +- 11 files changed, 347 insertions(+), 303 deletions(-) hooks/post-receive -- Postgres-XC |
From: Michael P <mic...@us...> - 2010-08-04 23:38:17
|
Project "Postgres-XC". The branch, master has been updated via 086c5c6be32d4ca9232523cd64caf6d29aaac42c (commit) from d7ca431066efe320107581186ab853b28fa5f7a7 (commit) - Log ----------------------------------------------------------------- commit 086c5c6be32d4ca9232523cd64caf6d29aaac42c Author: Michael P <mic...@us...> Date: Thu Aug 5 08:36:24 2010 +0900 Correction of bugs in pgxc_ddl reported by Bug report 3039166 in Source Forge Tracker. Those bugs were linked with string management problems in the script. diff --git a/src/bin/scripts/pgxc_ddl b/src/bin/scripts/pgxc_ddl index efc2f69..2442595 100644 --- a/src/bin/scripts/pgxc_ddl +++ b/src/bin/scripts/pgxc_ddl @@ -125,17 +125,17 @@ fi hosts=`cat $PGXC_CONF | grep coordinator_hosts | cut -d "'" -f 2` ports=`cat $PGXC_CONF | grep coordinator_ports | cut -d "'" -f 2` folders=`cat $PGXC_CONF | grep coordinator_folders | cut -d "'" -f 2` -if [ "hosts" = "" ] +if [ "$hosts" = "" ] then echo "coordinator_hosts not defined in pgxc.conf" exit 2 fi -if [ "ports" = "" ] +if [ "$ports" = "" ] then echo "coordinator_ports not defined in pgxc.conf" exit 2 fi -if [ "folders" = "" ] +if [ "$folders" = "" ] then echo "coordinator_folders not defined in pgxc.conf" exit 2 @@ -276,7 +276,7 @@ fi #Main process begins #Check if the database is defined, This could lead to coordinator being stopped uselessly -if [ $DB_NAME != "" ] +if [ "$DB_NAME" != "" ] then #Simply launch a fake SQL on the Database wanted $PSQL_CLIENT -h ${COORD_HOSTNAMES[$COORD_ORIG_INDEX]} -p ${COORD_PORTS[$COORD_ORIG_INDEX]} -c 'select now()' -d $DB_NAME; err=$? ----------------------------------------------------------------------- Summary of changes: src/bin/scripts/pgxc_ddl | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) hooks/post-receive -- Postgres-XC |