postgres-xc-general Mailing List for Postgres-XC

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

We made changes in internal snapshot handling to solve another problem around last December to January.   It will be very helpful if you try commits before and after this period.  This period’s change could be most suspect.

Best;
---
Koichi Suzuki

2013/12/12 9:19、Sandeep Gupta <gup...@gm...<mailto:gup...@gm...>> のメール：

Hi Mason,

 Thank you so much for taking the time. We are using pgxc 1.1. This was the stable release.  Let me give it a try with commits from previous versions. May take some time. I will get back to you with an update.

-Sandeep

On Wed, Dec 11, 2013 at 8:55 AM, Mason Sharp <ms...@tr...<mailto:ms...@tr...>> wrote:

On Tue, Dec 10, 2013 at 11:23 PM, Michael Paquier <mic...@gm...<mailto:mic...@gm...>> wrote:
On Wed, Dec 11, 2013 at 1:17 PM, Sandeep Gupta <gup...@gm...<mailto:gup...@gm...>> wrote:
> Hi Michael,
>
>  I can provide the table schema and the data over which indexing almost
> always fails with tuple not found error.
> Would this be of help.  The other issue is that file is 3.2GB  so we would
> have work some logistics to transfer this across.
Transferring a data file of a couple of gigs is out of question. My
point was to know if you are able to create a self-contained test case
using automatically generated data of this type:
create table foo as select generate_series(1,100000000) as a, 'bbbb'::text as b;
create index fooi on aa(a);
Or smth like that.
This way you wouldn't need to 1) publish your schema, 2) transfer huge
files of data. And this would make tracking of this error somewhat
easier.

Perhaps I could help Sandeep. Sandeep, before working out transferring such a large file though, please instead try to pinpoint the particular commit that introduced the issue.

I could recently reproduce a problem with VACUUM FULL in XC 1.1 (stop cluster, restart cluster, execute VACUUM FULL).  I could not reproduce the problem however with a build from the commit 11339220012a9e73cb82039b0ad41afd71bafca2 on Aug 22, 2012.  I suspect that a commit after that one to procarray.c may have something to do with the problem. It may be a similar issue for you. If you have the time, perhaps you can try different commits based on the git log of procarray.c to see if any particular one introduced the issue.

I would use the commit not just on procarray, but on the entire source tree. If you find a suspect one based on procarray.c, try the one immediately preceding it for the whole source tree to confirm that it is procarray-related.

Again, this all may be unrelated, but I think it is a good place to start, so I apologize in advance if this chews up some of your time unnecessarily.

Regards,
--
Michael

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Postgres-xc-general mailing list
Pos...@li...<mailto:Pos...@li...>
https://lists.sourceforge.net/lists/listinfo/postgres-xc-general

--
Mason Sharp

TransLattice - http://www.translattice.com<http://www.translattice.com/>
Distributed and Clustered Database Solutions

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk_______________________________________________
Postgres-xc-general mailing list
Pos...@li...
https://lists.sourceforge.net/lists/listinfo/postgres-xc-general

Hi Mason,

 Thank you so much for taking the time. We are using pgxc 1.1. This was the
stable release.  Let me give it a try with commits from previous versions.
May take some time. I will get back to you with an update.

-Sandeep

On Wed, Dec 11, 2013 at 8:55 AM, Mason Sharp <ms...@tr...>wrote:

>
>
>
> On Tue, Dec 10, 2013 at 11:23 PM, Michael Paquier <
> mic...@gm...> wrote:
>
>> On Wed, Dec 11, 2013 at 1:17 PM, Sandeep Gupta <gup...@gm...>
>> wrote:
>> > Hi Michael,
>> >
>> >  I can provide the table schema and the data over which indexing almost
>> > always fails with tuple not found error.
>> > Would this be of help.  The other issue is that file is 3.2GB  so we
>> would
>> > have work some logistics to transfer this across.
>> Transferring a data file of a couple of gigs is out of question. My
>> point was to know if you are able to create a self-contained test case
>> using automatically generated data of this type:
>> create table foo as select generate_series(1,100000000) as a,
>> 'bbbb'::text as b;
>> create index fooi on aa(a);
>> Or smth like that.
>> This way you wouldn't need to 1) publish your schema, 2) transfer huge
>> files of data. And this would make tracking of this error somewhat
>> easier.
>>
>
> Perhaps I could help Sandeep. Sandeep, before working out transferring
> such a large file though, please instead try to pinpoint the particular
> commit that introduced the issue.
>
> I could recently reproduce a problem with VACUUM FULL in XC 1.1 (stop
> cluster, restart cluster, execute VACUUM FULL).  I could not reproduce the
> problem however with a build from the
> commit 11339220012a9e73cb82039b0ad41afd71bafca2 on Aug 22, 2012.  I suspect
> that a commit after that one to procarray.c may have something to do with
> the problem. It may be a similar issue for you. If you have the time,
> perhaps you can try different commits based on the git log of procarray.c
> to see if any particular one introduced the issue.
>
> I would use the commit not just on procarray, but on the entire source
> tree. If you find a suspect one based on procarray.c, try the one
> immediately preceding it for the whole source tree to confirm that it is
> procarray-related.
>
> Again, this all may be unrelated, but I think it is a good place to start,
> so I apologize in advance if this chews up some of your time unnecessarily.
>
>
>
>> Regards,
>> --
>> Michael
>>
>>
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
>> Pro!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Postgres-xc-general mailing list
>> Pos...@li...
>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>>
>
>
>
> --
> Mason Sharp
>
> TransLattice - http://www.translattice.com
> Distributed and Clustered Database Solutions
>
>
>

On Tue, Dec 10, 2013 at 11:23 PM, Michael Paquier <mic...@gm...
> wrote:

> On Wed, Dec 11, 2013 at 1:17 PM, Sandeep Gupta <gup...@gm...>
> wrote:
> > Hi Michael,
> >
> >  I can provide the table schema and the data over which indexing almost
> > always fails with tuple not found error.
> > Would this be of help.  The other issue is that file is 3.2GB  so we
> would
> > have work some logistics to transfer this across.
> Transferring a data file of a couple of gigs is out of question. My
> point was to know if you are able to create a self-contained test case
> using automatically generated data of this type:
> create table foo as select generate_series(1,100000000) as a, 'bbbb'::text
> as b;
> create index fooi on aa(a);
> Or smth like that.
> This way you wouldn't need to 1) publish your schema, 2) transfer huge
> files of data. And this would make tracking of this error somewhat
> easier.
>

Perhaps I could help Sandeep. Sandeep, before working out transferring such
a large file though, please instead try to pinpoint the particular commit
that introduced the issue.

I could recently reproduce a problem with VACUUM FULL in XC 1.1 (stop
cluster, restart cluster, execute VACUUM FULL).  I could not reproduce the
problem however with a build from the
commit 11339220012a9e73cb82039b0ad41afd71bafca2 on Aug 22, 2012.  I suspect
that a commit after that one to procarray.c may have something to do with
the problem. It may be a similar issue for you. If you have the time,
perhaps you can try different commits based on the git log of procarray.c
to see if any particular one introduced the issue.

I would use the commit not just on procarray, but on the entire source
tree. If you find a suspect one based on procarray.c, try the one
immediately preceding it for the whole source tree to confirm that it is
procarray-related.

Again, this all may be unrelated, but I think it is a good place to start,
so I apologize in advance if this chews up some of your time unnecessarily.

> Regards,
> --
> Michael
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> Postgres-xc-general mailing list
> Pos...@li...
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>

-- 
Mason Sharp

TransLattice - http://www.translattice.com
Distributed and Clustered Database Solutions

On Wed, Dec 11, 2013 at 1:17 PM, Sandeep Gupta <gup...@gm...> wrote:
> Hi Michael,
>
>  I can provide the table schema and the data over which indexing almost
> always fails with tuple not found error.
> Would this be of help.  The other issue is that file is 3.2GB  so we would
> have work some logistics to transfer this across.
Transferring a data file of a couple of gigs is out of question. My
point was to know if you are able to create a self-contained test case
using automatically generated data of this type:
create table foo as select generate_series(1,100000000) as a, 'bbbb'::text as b;
create index fooi on aa(a);
Or smth like that.
This way you wouldn't need to 1) publish your schema, 2) transfer huge
files of data. And this would make tracking of this error somewhat
easier.
Regards,
-- 
Michael

Hi Michael,

 I can provide the table schema and the data over which indexing almost
always fails with tuple not found error.
Would this be of help.  The other issue is that file is 3.2GB  so we would
have work some logistics to transfer this across.

Let me know.

-Sandeep

On Mon, Dec 9, 2013 at 5:49 PM, Michael Paquier
<mic...@gm...>wrote:

> On Tue, Dec 10, 2013 at 7:17 AM, Sandeep Gupta <gup...@gm...>
> wrote:
> >  We are trying to trace cause and potential solution of "tuple not found"
> > error with postgres-xc. The problem happens when indexing a large file.
> It
> > seems the autovaccum locks certain cache pages that the indexer tries to
> > read. The indexing fails with  "tuple not found" error.
> >
> > I am sure if it qualifies as postgres or postgres-xc error. However, I
> was
> > just wondering what is the recommended way to go about fixing this.
> Turning
> > off the autovaccumer  is really not the best of solution because then the
> > system runs into memory usage error.
> >
> > Would greatly appreciate any pointers on this.
> This smells like a concurrency issue with autovacuum on XC side. I
> recall fixing in the past issues with autovacuum not taking a correct
> snapshot from GTM in certain code paths, putting in danger data
> consistency in the cluster as autovacuum might clean more tuples than
> it should. Another possibility to explain this bug would be the way
> RecentGlobalXmin is computed for autovacuum using the GTM snapshots,
> which would explain why autovacuum has cleaned away some tuples it
> should not have, making the possibility of a failure higher for
> long-running transactions.
>
> Those are assumptions though. It would be great if you could provide a
> self-contained test case, with let's imagine a table that has its data
> generated with for example generate_series. Just by seeing the spec of
> the machine you are using, I am sure that i wouldn't be able to
> reproduce that on my laptop though. The core team has access to more
> powerful machines.
>
> Also: Postgres-XC 1.1.0 is based on Postgres 9.2.4.
> --
> Michael
>

In 1.0, we added new APIs to GTM so that vacuum can run with global
XID and snapshot.   We may need more improvement to use this.

It is wonderful if Mason provides a patch to fix this.

Regards;
---
Koichi Suzuki

2013/12/11 Michael Paquier <mic...@gm...>:
> On Tue, Dec 10, 2013 at 11:00 PM, Mason Sharp <ms...@tr...> wrote:
>> In our StormDB fork (now TransLattice Storm) I made some changes to address
>> some issues that were uncovered with XC. I am not sure if it will address
>> this specific issue above, but in most cases we make it an error instead of
>> falling back to a local XID like XC does (imagine if a node cannot reach GTM
>> and autovacuum starts cleaning up data with local XIDs and snapshots) .
> Yep, falling back to a local xid when GTM is not reachable has been
> done since the beginning of the project. Considering that as a bug
> using the argument that it endangers data visibility, such a patch
> should be back-patched as well. Some insight on those remarks from the
> core team would be welcome though.
>
>> Also, we use GTM for getting XIDs for authentication and for autovacuum
>> launcher because in concurrency testing not doing so results in the same XID
>> being consumed by other sessions and causing hanging and other transaction
>> problems. The bottom line is falling back to local XIDs and snapshots should
>> almost always be avoided (initdb is ok).
> Check.
>
>> Our code is a bit different from vanilla XC, but I can try to put together a
>> similar patch soon.
> This would be welcome.
>
>> As a community I feel we should prioritize more on testing and bug fixing
>> like the reported issue and replicated table handling than on new features
>> like the merged coordinator and datanode project.
> Definitely, *normal* developers cannot afford spending so much time on
> projects as big as that. One of the big things that I see missing is a
> public instance of an XC buildfarm, by using for example the buildfarm
> code of Postgres that simply fetches the code from git, and kicks
> in-core tests. For XC this should be restricted though to regressions,
> and compilation. pg_upgrade or isolation tests are not really
> working...
>
> Regards,
> --
> Michael
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> Postgres-xc-general mailing list
> Pos...@li...
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general

On Tue, Dec 10, 2013 at 11:00 PM, Mason Sharp <ms...@tr...> wrote:
> In our StormDB fork (now TransLattice Storm) I made some changes to address
> some issues that were uncovered with XC. I am not sure if it will address
> this specific issue above, but in most cases we make it an error instead of
> falling back to a local XID like XC does (imagine if a node cannot reach GTM
> and autovacuum starts cleaning up data with local XIDs and snapshots) .
Yep, falling back to a local xid when GTM is not reachable has been
done since the beginning of the project. Considering that as a bug
using the argument that it endangers data visibility, such a patch
should be back-patched as well. Some insight on those remarks from the
core team would be welcome though.

> Also, we use GTM for getting XIDs for authentication and for autovacuum
> launcher because in concurrency testing not doing so results in the same XID
> being consumed by other sessions and causing hanging and other transaction
> problems. The bottom line is falling back to local XIDs and snapshots should
> almost always be avoided (initdb is ok).
Check.

> Our code is a bit different from vanilla XC, but I can try to put together a
> similar patch soon.
This would be welcome.

> As a community I feel we should prioritize more on testing and bug fixing
> like the reported issue and replicated table handling than on new features
> like the merged coordinator and datanode project.
Definitely, *normal* developers cannot afford spending so much time on
projects as big as that. One of the big things that I see missing is a
public instance of an XC buildfarm, by using for example the buildfarm
code of Postgres that simply fetches the code from git, and kicks
in-core tests. For XC this should be restricted though to regressions,
and compilation. pg_upgrade or isolation tests are not really
working...

Regards,
-- 
Michael

Sorry for the late response.     I suspect that you’re using serval() for all the DML.    If so, I advice to use nextval() instead by setting up increment value to your most favorite value.   Or, if you use setval(), you should specify increment value to the most common value you use.

This decreases a chance that GTM determines sequence value is incremented too much and have to backup the next value.

Good luck;
---
Koichi Suzuki

2013/11/30 0:31、Matej Jellus <je...@ts...<mailto:je...@ts...>> のメール：

Hello,
we solved the problem with GTM - when we deleted the sequence it stopped writing extra data to log file. But the volume of inserted data per minute is still very small.

On 27. 11. 2013 10:52, Koichi Suzuki wrote:
This could be a cause.   Are you using sequence heavily?  GTM backs up its restart point every 2000 updates of Transaction ID or each sequences value.

Regards;

---
Koichi Suzuki

2013/11/27 Matej Jellus <je...@ts...<mailto:je...@ts...>>
The value on the first line is increasing realy slow. The sequence is increasing fast.
gtm.control :
prev :
727618
gps.public.ai_value\00  30957   1       1       1       9223372036854775806     f       t       1
next :
727618
gps.public.ai_value\00  30958   1       1       1       9223372036854775806     f       t       1

Thank you, best regards
Matej

On 27. 11. 2013 10:33, Koichi Suzuki wrote:
Could you visit GTM's data directory and watch the file gtm.control?
The first line is current restart point of GXID value.   If you watch this value periodically, it shows how many transactions are coming to GTM.

If the value does not increase so much, then please see another line, which are current value of sequences.   This will tell how often GTM has to backup itself.  If not there could be something wrong in GTM.

Best Regards;

---
Koichi Suzuki

2013/11/27 Matej Jellus <je...@ts...<mailto:je...@ts...>>
Hello,
I don't understand how it can create so many transactions. If I start phppgadmin, create non-partitioning table and insert there one row, nothing is written to log file. If I create partitioning table and insert there row, it writes
'Saving transaction restoration info, backed-up gxid' to log file. If I send there "INSERT INTO <table> VALUES(...), (...), (...);" it writes that row three times to log file. I don't know how it is related to the partitioning, but it surely has dependencies.

Thank you, best regards
Matej

On 25. 11. 2013 8:47, Koichi Suzuki wrote:
Your experiment says that your application created more than hundreds of thousand of transactions in a second.    Write to GTM will be done once every two thousand transactions.    I'm wondering how it happens and how it is related to the partitioning.

Regards;

---
Koichi Suzuki

2013/11/20 Matej Jellus <je...@ts...<mailto:je...@ts...>>
Hello,
is it a problem of configuration? Or is it all bad? Should we wait for some answers?

Thank you, best regards
Matej

On 11. 11. 2013 15:28, Matej Jellus wrote:
Hello, here is the process.

When we created table with DISTRIBUTE BY REPLICATION TO NODE ( datanode1 ), it has been inserting maybe 100 000 rows per minute.
When we created table with DISTRIBUTE BY REPLICATION TO NODE ( datanode1, datanode2 ), it has been inserting maybe 50 000 rows per minute.
When we created table witch  DISTRIBUTE BY MODULO ( cluster_id ) TO NODE ( datanode1, datanode2 ) and has been writing to only one node, it was inserting 170 000 rows per minute.

But when we created it with partitioning, it has been 5000 rows per minute.

When we insert row into nonpartitioning table, no row was generated into the gtm log file. When we insert row into partitioning table, extra row is inserted into gtm log file : Saving transaction restoration info, backed-up gxid: 409215
LOCATION:  GTM_WriteRestorePointXid, gtm_txn.c:2649.

When deleting or updating from nonpartitioning table, everything is going quickly. But when we do the same thing with partitioning table, it is 3 - 4 times slower.

Below is example with partitioning. We are importing data to database, so first we insert unit_id into table units. This will generate schema for the year we are inserting, based on the column gps_timestamp. After that it create table for the unit in the schema, which inherits from primary talbe - units_data. Then we are inserting only to table units_data.

CREATE TABLE data.units ( unit_id uuid NOT NULL, cluster_id INT NOT NULL );

CREATE SEQUENCE ai_value;

CREATE TABLE data.units_data (
    row_id INTEGER NOT NULL DEFAULT nextval('ai_value'),
    unit_id uuid NOT NULL,
    cluster_id INT NOT NULL,
    gps_timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
    gps_lat DOUBLE PRECISION NOT NULL,
    gps_lng DOUBLE PRECISION NOT NULL,
    others hstore,
    PRIMARY KEY ( unit_id,cluster_id )
);

CREATE OR REPLACE FUNCTION generate_all(IN cluster_id INTEGER,  IN unit_id UUID) RETURNS VOID AS $$
DECLARE
    ROK VARCHAR(4);
    DATANODE_ID VARCHAR(1);
    NAZOV_SCHEMY VARCHAR(13);
    NAZOV_TABULKY VARCHAR;
    NAZOV_END_TABULKY VARCHAR;
    NAZOV_ROK_TABULKY VARCHAR;
BEGIN
    -- SELECT 2013 INTO ROK;
    ROK := 2013;
    SELECT cluster_id INTO DATANODE_ID;
    NAZOV_SCHEMY := 'node_'||DATANODE_ID||'_y_'||ROK;
    NAZOV_ROK_TABULKY := 'units_data_2013';
    NAZOV_TABULKY := 'u_'||replace(unit_id::text,'-','');
    NAZOV_END_TABULKY := NAZOV_TABULKY||'_m';
    -- Ide sa vytvarat schema a tabulky
        IF NOT EXISTS(SELECT * FROM information_schema.schemata WHERE schema_name=NAZOV_SCHEMY) THEN
            EXECUTE 'CREATE SCHEMA '||NAZOV_SCHEMY||';';
        END IF;
        IF NOT EXISTS(SELECT * FROM information_schema.tables where table_name=NAZOV_ROK_TABULKY AND table_schema=NAZOV_SCHEMY) THEN
            EXECUTE 'CREATE TABLE '||NAZOV_SCHEMY||'.'||NAZOV_ROK_TABULKY||'( CHECK ( date_part(''year''::text, gps_timestamp)=2013 ) ) INHERITS ( data.units_data );';
        END IF;
        EXECUTE 'CREATE OR REPLACE RULE "year2013_node'||DATANODE_ID||'_choice" AS ON INSERT TO data.units_data WHERE date_part(''year''::text, NEW.gps_timestamp)=2013 AND NEW.cluster_id='||DATANODE_ID||' DO INSTEAD INSERT INTO '||NAZOV_SCHEMY||'.units_data_2013 VALUES(NEW.*);';

        EXECUTE 'CREATE TABLE '||NAZOV_SCHEMY||'.'||NAZOV_TABULKY||'( CHECK ( unit_id='''||unit_id||'''::uuid ) ) INHERITS ('||NAZOV_SCHEMY||'.'||NAZOV_ROK_TABULKY||' );';

        EXECUTE 'CREATE RULE "'||NAZOV_TABULKY||'_choice" AS ON INSERT TO '||NAZOV_SCHEMY||'.units_data_2013 WHERE unit_id='''||unit_id||''' DO INSTEAD INSERT INTO '||NAZOV_SCHEMY||'.'||NAZOV_TABULKY||' VALUES(NEW.*);';
    RETURN;
END
$$ LANGUAGE plpgsql;

CREATE RULE "generate_all_rule" AS ON INSERT TO data.units DO ALSO SELECT generate_all(NEW.cluster_id, NEW.unit_id);

Thank you, best regards
Matej

On 11. 11. 2013 10:11, 鈴木 幸市 wrote:
This does not show anything special.   As I wrote, the messages in question don’t appear so often.     Could you test how many of these messages are written using another case, for example, without using partitioning.   I think partitioning table has nothing to do with the issue but I’d like to know if the case happens in other DMLs and DDLs.

Thank you.
---
Koichi Suzuki

2013/11/11 17:49、Matej Jellus <je...@ts...><mailto:je...@ts...> のメール：

Our situation :
two machines
first - primary GTM, coordinator, datanode
second - standby GTM, coordinator, datanode

GTM configuration below. We are inserting 5 thousand rows per minute, which is too low for us. We need to insert 100 thousand rows per minute.
Starting primary GTM : /gtm_ctl -Z gtm start -D /var/pgxc/data_gtm
Starting standby GTM : /gtm_ctl -Z gtm start -D /var/pgxc/data_gtm
Starting coord : pg_ctl start -D /var/pgxc/ubuntu2_coord -Z coordinator -l /tmp/logfile_ubuntu2_coord

Tables in database are using partitioning. So we have one primary table and we are creating child tables from it. For example :
primary table :
chema - data
table - units_data

child tables :
schema : y_2013_node1 - tables, which will be saved in node 1
   table : u_<unit_id> - child table from data.units_data
schema : y_2013_node2 - tables, which will be saved in node 2
   table : u_<unit_id> - child table from data.units_data

Thank you, best regards
Matej

On 11. 11. 2013 7:49, 鈴木 幸市 wrote:
Message 1. will be written once every 2000 transactions.    The number looks too large.   Message 1 is written in the three cases:

1) When GTM starts,
2) Once every 2000 transactions,
3) When GTM standby is promoted to a master.

Could you let me know how you start your cluster?    We’ve tested XC with DBT-1 and DBT-2 but didn’t have such frequent write at gtm.   You are inserting 5000 raws in a second so 2) should happen twice or three times a second.

Message 2. is written when GTM standby connects to GTM.   Are you using GTM standby?   If so, how you are starting it?

We’re testing XC in different environment but didn’t have such GTM overload.   I’m interested in the situation.

Best;
---
Koichi Suzuki

2013/11/09 0:06、Matej Jellus <je...@ts...><mailto:je...@ts...> のメール：

Hello,

We have problem that postgres xc is very slow, gtm is writing to log a
lot of data. Is it able to be disabled?
The most repeated lines are :

1:140130870839040:2013-11-08 15:42:49.357 CET -LOG:  Saving transaction
restoration info, backed-up gxid: 373106
LOCATION:  GTM_WriteRestorePointXid, gtm_txn.c:2649
( this is maybe 130 times per second )

1:140130879186688:2013-11-08 15:45:37.572 CET -LOG:  Connection
established with GTM standby. - 0x1f91398
LOCATION:  gtm_standby_connect_to_standby_int, gtm_standby.c:400

It is overloading the disk, i/o is 70%.

Now it is inserting 5 thousand inserts per minute. We need more inserts
per minute.

Thank you, best regards
Matej

------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
_______________________________________________
Postgres-xc-general mailing list
Pos...@li...<mailto:Pos...@li...>
https://lists.sourceforge.net/lists/listinfo/postgres-xc-general

--

S pozdravom, best regards
                                                                                                               Matej Jellus

<メールの添付ファイル.png>

Technický pracovník - GPS monitoring

TSS Group a.s.
K Zábraniu 1653
911 01 Trenčín
tel:  +421 32 744 5921<tel:%2B421%2032%20744%205921>
fax: +421 32 744 5922<tel:%2B421%2032%20744%205922>
e-mail:je...@ts...<mailto:je...@ts...>
url: www.tssgroup.sk<http://www.tssgroup.sk/>
url: www.gpsmonitoring.sk<http://www.gpsmonitoring.sk/>

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Postgres-xc-general mailing list
Pos...@li...<mailto:Pos...@li...>
https://lists.sourceforge.net/lists/listinfo/postgres-xc-general

--

S pozdravom, best regards
                                                                                                               Matej Jellus

<メールの添付ファイル.png>

Technický pracovník - GPS monitoring

TSS Group a.s.
K Zábraniu 1653
911 01 Trenčín
tel:  +421 32 744 5921<tel:%2B421%2032%20744%205921>
fax: +421 32 744 5922<tel:%2B421%2032%20744%205922>
e-mail:je...@ts...<mailto:je...@ts...>
url: www.tssgroup.sk<http://www.tssgroup.sk/>
url: www.gpsmonitoring.sk<http://www.gpsmonitoring.sk/>

--

S pozdravom, best regards
                                                                                                               Matej Jellus

<メールの添付ファイル.png>

Technický pracovník - GPS monitoring

TSS Group a.s.
K Zábraniu 1653
911 01 Trenčín
tel:  +421 32 744 5921<tel:%2B421%2032%20744%205921>
fax: +421 32 744 5922<tel:%2B421%2032%20744%205922>
e-mail:je...@ts...<mailto:je...@ts...>
url: www.tssgroup.sk<http://www.tssgroup.sk/>
url: www.gpsmonitoring.sk<http://www.gpsmonitoring.sk/>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk_______________________________________________
Postgres-xc-general mailing list
Pos...@li...
https://lists.sourceforge.net/lists/listinfo/postgres-xc-general

On Mon, Dec 9, 2013 at 8:49 PM, Michael Paquier
<mic...@gm...>wrote:

> On Tue, Dec 10, 2013 at 7:17 AM, Sandeep Gupta <gup...@gm...>
> wrote:
> >  We are trying to trace cause and potential solution of "tuple not found"
> > error with postgres-xc. The problem happens when indexing a large file.
> It
> > seems the autovaccum locks certain cache pages that the indexer tries to
> > read. The indexing fails with  "tuple not found" error.
> >
> > I am sure if it qualifies as postgres or postgres-xc error. However, I
> was
> > just wondering what is the recommended way to go about fixing this.
> Turning
> > off the autovaccumer  is really not the best of solution because then the
> > system runs into memory usage error.
> >
> > Would greatly appreciate any pointers on this.
> This smells like a concurrency issue with autovacuum on XC side. I
> recall fixing in the past issues with autovacuum not taking a correct
> snapshot from GTM in certain code paths, putting in danger data
> consistency in the cluster as autovacuum might clean more tuples than
> it should. Another possibility to explain this bug would be the way
> RecentGlobalXmin is computed for autovacuum using the GTM snapshots,
> which would explain why autovacuum has cleaned away some tuples it
> should not have, making the possibility of a failure higher for
> long-running transactions.
>

In our StormDB fork (now TransLattice Storm) I made some changes to address
some issues that were uncovered with XC. I am not sure if it will address
this specific issue above, but in most cases we make it an error instead of
falling back to a local XID like XC does (imagine if a node cannot reach
GTM and autovacuum starts cleaning up data with local XIDs and snapshots) .
Also, we use GTM for getting XIDs for authentication and for autovacuum
launcher because in concurrency testing not doing so results in the same
XID being consumed by other sessions and causing hanging and other
transaction problems. The bottom line is falling back to local XIDs and
snapshots should almost always be avoided (initdb is ok).

Our code is a bit different from vanilla XC, but I can try to put together
a similar patch soon.

As a community I feel we should prioritize more on testing and bug fixing
like the reported issue and replicated table handling than on new features
like the merged coordinator and datanode project.

>
> Those are assumptions though. It would be great if you could provide a
> self-contained test case, with let's imagine a table that has its data
> generated with for example generate_series. Just by seeing the spec of
> the machine you are using, I am sure that i wouldn't be able to
> reproduce that on my laptop though. The core team has access to more
> powerful machines.
>
> Also: Postgres-XC 1.1.0 is based on Postgres 9.2.4.
> --
> Michael

Mason Sharp

TransLattice - http://www.translattice.com
Distributed and Clustered Database Solutions

On Tue, Dec 10, 2013 at 7:17 AM, Sandeep Gupta <gup...@gm...> wrote:
>  We are trying to trace cause and potential solution of "tuple not found"
> error with postgres-xc. The problem happens when indexing a large file. It
> seems the autovaccum locks certain cache pages that the indexer tries to
> read. The indexing fails with  "tuple not found" error.
>
> I am sure if it qualifies as postgres or postgres-xc error. However, I was
> just wondering what is the recommended way to go about fixing this. Turning
> off the autovaccumer  is really not the best of solution because then the
> system runs into memory usage error.
>
> Would greatly appreciate any pointers on this.
This smells like a concurrency issue with autovacuum on XC side. I
recall fixing in the past issues with autovacuum not taking a correct
snapshot from GTM in certain code paths, putting in danger data
consistency in the cluster as autovacuum might clean more tuples than
it should. Another possibility to explain this bug would be the way
RecentGlobalXmin is computed for autovacuum using the GTM snapshots,
which would explain why autovacuum has cleaned away some tuples it
should not have, making the possibility of a failure higher for
long-running transactions.

Those are assumptions though. It would be great if you could provide a
self-contained test case, with let's imagine a table that has its data
generated with for example generate_series. Just by seeing the spec of
the machine you are using, I am sure that i wouldn't be able to
reproduce that on my laptop though. The core team has access to more
powerful machines.

Also: Postgres-XC 1.1.0 is based on Postgres 9.2.4.
-- 
Michael

Hello All,

 We are trying to trace cause and potential solution of "tuple not found"
error with postgres-xc. The problem happens when indexing a large file. It
seems the autovaccum locks certain cache pages that the indexer tries to
read. The indexing fails with  "tuple not found" error.

I am sure if it qualifies as postgres or postgres-xc error. However, I was
just wondering what is the recommended way to go about fixing this. Turning
off the autovaccumer  is really not the best of solution because then the
system runs into memory usage error.

Would greatly appreciate any pointers on this.

-Sandeep

On Mon, Dec 9, 2013 at 1:43 PM, Sandeep Gupta <gup...@gm...>wrote:

> Agreed. However, the "tuple not found" error problem seems to happen with
> postgres as well, if not in the particular case index creation over large
> datasets. It would helpful to know in that scenario what are the fixes and
> how to avoid it in the first place.
>
> The solution/fixes for postgres will carry over to pgxc because the low
> level stuff (blocks, files, indexes) etc. postgres-xc is really the same as
> in postgres.
>
> -Sandeep
>
>
>
> On Mon, Dec 9, 2013 at 1:23 PM, John R Pierce <pi...@ho...> wrote:
>
>>  On 12/9/2013 10:07 AM, RUSHI KAW wrote:
>>
>> I have been running Postgresxc 1.1
>>
>>
>> you'd probably be best off finding the postgresql-xc list, as that is
>> really a rather different system, even if it is forked from community
>> postgresql.
>>
>>
>>
>> --
>> john r pierce                                      37N 122W
>> somewhere on the middle of the left coast
>>
>>
>

So nice to find the cause finally.   Hope your work goes smoothly then.

Thank you ;
---
Koichi Suzuki

2013/12/6 Sandeep Gupta <gup...@gm...>:
> Hi Koichi,
>
>  We finally fixed the mystery of the killing postgres processes during copy.
> It was nothing to do with postgres-xc. Turns out that some misconfiguration
> of our compute system that resulted in killing the process. Sorry for the
> trouble. We have executed several runs with the copy command and they are
> working fine so far.
>
> -Sandeep
>
>
>
> On Mon, Nov 18, 2013 at 10:17 PM, Koichi Suzuki <koi...@gm...>
> wrote:
>>
>> Thanks.   I'll be looking forward to your test result.   I talked with
>> Masataka and he was interested that no relevant messages were found in
>> dmsg.
>>
>> Regards;
>> ---
>> Koichi Suzuki
>>
>>
>> 2013/11/19 Sandeep Gupta <gup...@gm...>:
>> > Hi Koichi,
>> >
>> >  Thanks for taking a look. I did some further debugging and I won't rule
>> > out the possibility of the the OS killing some postgres processes which
>> > leads to copy execution reaching in a inconsistent state.
>> > Will do some further tests to confirm this as well.
>> >
>> > -Sandeep
>> >
>> >
>> >
>> >
>> > On Mon, Nov 18, 2013 at 1:06 AM, Koichi Suzuki <koi...@gm...>
>> > wrote:
>> >>
>> >> This happens when the message is not null-terminated properly.   Most
>> >> case is a bug.   In this case, different from vanilla PostgreSQL, copy
>> >> data is involved in.    Please let me find a time to look into it.
>> >>
>> >> Regards;
>> >> ---
>> >> Koichi Suzuki
>> >>
>> >>
>> >> 2013/11/16 Sandeep Gupta <gup...@gm...>:
>> >> > Forgot to mention  that the error is still the same "invalid string
>> >> > in
>> >> > message".
>> >> >
>> >> > -Sandeep
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Nov 15, 2013 at 8:27 PM, Sandeep Gupta
>> >> > <gup...@gm...>
>> >> > wrote:
>> >> >>
>> >> >> Hi Koichi,
>> >> >>
>> >> >>  We tried the  copy with the patch but it did not fix the problem
>> >> >> for
>> >> >> us.
>> >> >> If you need any additional details then let me know.
>> >> >>
>> >> >> Best,
>> >> >> -Sandeep
>> >> >>
>> >> >>
>> >> >> On Fri, Nov 15, 2013 at 6:13 AM, Sandeep Gupta
>> >> >> <gup...@gm...>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi Koichi,
>> >> >>>
>> >> >>>  Many thanks for sharing the patch. We would test this in our
>> >> >>> environment
>> >> >>> and would report back on the result.
>> >> >>>
>> >> >>> Regarding the very high memory consumption, it is not the
>> >> >>> coordinator
>> >> >>> that sees the high memory usage.
>> >> >>> The setup is following: We have one coordinator and 4 datanodes.
>> >> >>> All
>> >> >>> the
>> >> >>> datanodes run on separate machine
>> >> >>> then the coordinator.  It is the machine where the datanodes are
>> >> >>> running
>> >> >>> that have 48GB memory usage.
>> >> >>> The other caveat is that that "top" doesn't show that it is the
>> >> >>> datanodes
>> >> >>> processes using the memory.
>> >> >>>
>> >> >>> In any case, the last run of copy even with the high usage went
>> >> >>> through
>> >> >>> fine.
>> >> >>>
>> >> >>>
>> >> >>> Best,
>> >> >>> Sandeep
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki
>> >> >>> <koi...@gm...>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Yes, I've already found this kind of failure.   When disk write is
>> >> >>>> quite slower than read, coordinator overshoots.  However, in this
>> >> >>>> case, I thought we will see that palloc() overflows.   If each
>> >> >>>> piece
>> >> >>>> of COPY TO text data is less than 1GB, then each COPY TO succeeds.
>> >> >>>> If the write is much slower than read,  each coordinator will
>> >> >>>> allocate
>> >> >>>> memory for each COPY TO cache going to datanodes, ending up with
>> >> >>>> enormous amount of memory consumption and have a good chance to
>> >> >>>> run
>> >> >>>> OOM killer.    Your memory usage looks enormous.    Maybe you have
>> >> >>>> many COPY TO instances each of which allocates too much memory.
>> >> >>>>
>> >> >>>> Attached patch fixes COPY TO.   Because we have a chance to have
>> >> >>>> similar overshoots in COPY FROM, I'd like to write a patch to fix
>> >> >>>> this
>> >> >>>> too before committing.   This is for the current master but will
>> >> >>>> work
>> >> >>>> with 1.1.x (even with 1.0.x hopefully).
>> >> >>>>
>> >> >>>> The patch looks to be fine.   I asked to test it in such an
>> >> >>>> environment and they reported the patch fixed their issue.
>> >> >>>>
>> >> >>>> I'm not sure but hope this works in your case.   May reduce memory
>> >> >>>> consumption drastically.   Please let me know the result.
>> >> >>>>
>> >> >>>> I have to allocate some of my resource to fix the same issue in
>> >> >>>> COPY
>> >> >>>> FROM.    I have another request to fix this so if the patch works
>> >> >>>> for
>> >> >>>> you, I may be able to commit only COPY TO fix first.
>> >> >>>>
>> >> >>>> Regards;
>> >> >>>> ---
>> >> >>>> Koichi Suzuki
>> >> >>>>
>> >> >>>>
>> >> >>>> 2013/11/15 Sandeep Gupta <gup...@gm...>:
>> >> >>>> > Hi Koichi,
>> >> >>>> >
>> >> >>>> >  I am using the copy command over the text file to populate a
>> >> >>>> > table.
>> >> >>>> >
>> >> >>>> > Sorry for the confusion. I meant 628 -- 638 lines in the
>> >> >>>> > pqformat.c.
>> >> >>>> > Let me explain this a bit more. I get  error in some of the
>> >> >>>> > datanode
>> >> >>>> > logfiles
>> >> >>>> > as "invalid string in message". The message is mentioned only in
>> >> >>>> > pqformat.c
>> >> >>>> > at line 628. Although I am not able to trace back as to where
>> >> >>>> > error
>> >> >>>> > may
>> >> >>>> > have first ordered in the datanode that triggered this error.
>> >> >>>> >
>> >> >>>> > In order to debug this I started printing out the rows the
>> >> >>>> > coordinator
>> >> >>>> > is
>> >> >>>> > sending to the datanode (in function DataNodeCopyIn). For some
>> >> >>>> > reason
>> >> >>>> > the
>> >> >>>> > error invalid string in messge error has  not happened so far.
>> >> >>>> > It
>> >> >>>> > seems
>> >> >>>> > there is some issue with the flow control of message between
>> >> >>>> > coordinator and
>> >> >>>> > datanode.
>> >> >>>> > Hope this helps in narrowing down what is causing the problem in
>> >> >>>> > the
>> >> >>>> > first
>> >> >>>> > place.
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > However, the new problem I see is that the memory usage goes
>> >> >>>> > very
>> >> >>>> > high. Of
>> >> >>>> > the 48 GB almost all of the memory is getting used. Is this
>> >> >>>> > normal?
>> >> >>>> > If not can you give suggestion as to how to keep the memory
>> >> >>>> > consumption
>> >> >>>> > low.
>> >> >>>> >
>> >> >>>> > -Sandeep
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki
>> >> >>>> > <koi...@gm...>
>> >> >>>> > wrote:
>> >> >>>> >>
>> >> >>>> >> Could you let me know what "line 682 -- 637" means?
>> >> >>>> >>
>> >> >>>> >> Because any communication between coordinator and datanode is
>> >> >>>> >> done
>> >> >>>> >> using libpq, I don't think XC populates information using raw
>> >> >>>> >> stringinfo.   In fact, from datanodes, coordinator is just a
>> >> >>>> >> client
>> >> >>>> >> except for supplying XID and snapshot.   However,  COPY TO uses
>> >> >>>> >> send()/receive(), not through libpq, to send data to datanode.
>> >> >>>> >> In
>> >> >>>> >> this case, if incoming data file contains NULL in the data,
>> >> >>>> >> such
>> >> >>>> >> data
>> >> >>>> >> may be sent via send() to datanodes.   COPY TO assumes incoming
>> >> >>>> >> data
>> >> >>>> >> in a text format.   Do you think you have such situations?
>> >> >>>> >>
>> >> >>>> >> Analysis may take a bit but let me try.
>> >> >>>> >>
>> >> >>>> >> Regards;
>> >> >>>> >> ---
>> >> >>>> >> Koichi Suzuki
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> 2013/11/15 Sandeep Gupta <gup...@gm...>:
>> >> >>>> >> > Hi Koichi, Masataka,
>> >> >>>> >> >
>> >> >>>> >> >  Thanks for taking a look. I double checked both memory usage
>> >> >>>> >> > and
>> >> >>>> >> > the
>> >> >>>> >> > dmesg
>> >> >>>> >> > log. The kill is not happening due to the out of memory. The
>> >> >>>> >> > kernel
>> >> >>>> >> > doesn't
>> >> >>>> >> > show any processes being killed.
>> >> >>>> >> >
>> >> >>>> >> > My current suspicion is that something amiss is happening
>> >> >>>> >> > over
>> >> >>>> >> > the
>> >> >>>> >> > network.
>> >> >>>> >> > This happens consistently when performing copy over large
>> >> >>>> >> > datasets.
>> >> >>>> >> > Few
>> >> >>>> >> > of
>> >> >>>> >> > datanodes end up with "invalid string in message".
>> >> >>>> >> > Going over the code in pqformat.c this happens because the
>> >> >>>> >> > invariant
>> >> >>>> >> > "StringInfo is guaranteed to have a trailing null byte" is
>> >> >>>> >> > not
>> >> >>>> >> > valid
>> >> >>>> >> > (line
>> >> >>>> >> > 682 -- 637). I am not sure why this happens.
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > After this the datanodes do something illegal (probably
>> >> >>>> >> > buffer
>> >> >>>> >> > overflow
>> >> >>>> >> > on
>> >> >>>> >> > the port..just a guess) that the OS decides to issue a
>> >> >>>> >> > sigkill.
>> >> >>>> >> > I
>> >> >>>> >> > am
>> >> >>>> >> > also
>> >> >>>> >> > not sure why the OS is not logging this killed operation.
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > Please let me know of  suggestions you may have regarding the
>> >> >>>> >> > "invalid
>> >> >>>> >> > string in message". In particular, where in the code is
>> >> >>>> >> > StringInfo
>> >> >>>> >> > msg
>> >> >>>> >> > is
>> >> >>>> >> > getting populated i.e. which code reads the message over the
>> >> >>>> >> > network and
>> >> >>>> >> > keeps it ready for the copy operation on the datanode. Any
>> >> >>>> >> > other
>> >> >>>> >> > suggestion
>> >> >>>> >> > would also be helpful to trace this bug.
>> >> >>>> >> >
>> >> >>>> >> > Best,
>> >> >>>> >> > Sandeep
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito
>> >> >>>> >> > <pg...@gm...>
>> >> >>>> >> > wrote:
>> >> >>>> >> >>
>> >> >>>> >> >> Hi,
>> >> >>>> >> >>
>> >> >>>> >> >> I think I may be onto something.
>> >> >>>> >> >>
>> >> >>>> >> >> When the memory is exhausted, the Linux kernel kills process
>> >> >>>> >> >> randomly
>> >> >>>> >> >> by SIGKILL.
>> >> >>>> >> >> The mechanism is called OOM(Out Of Memory) killer.
>> >> >>>> >> >> OOM killer logs its activity to kernel mesage buffer.
>> >> >>>> >> >>
>> >> >>>> >> >> Could you check display message (dmesg) and memory resource?
>> >> >>>> >> >>
>> >> >>>> >> >> Regards.
>> >> >>>> >> >>
>> >> >>>> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta
>> >> >>>> >> >> <gup...@gm...>
>> >> >>>> >> >> wrote:
>> >> >>>> >> >> > Hi Koichi,
>> >> >>>> >> >> >
>> >> >>>> >> >> >    It is a bit of mystery  because it does not happen
>> >> >>>> >> >> > consistently.
>> >> >>>> >> >> > Thanks
>> >> >>>> >> >> > for the clarification though. It is indeed helpful.
>> >> >>>> >> >> >
>> >> >>>> >> >> > -Sandeep
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市
>> >> >>>> >> >> > <ko...@in...>
>> >> >>>> >> >> > wrote:
>> >> >>>> >> >> >>
>> >> >>>> >> >> >> Someone is sending SIGKILL to coordinator/datanode
>> >> >>>> >> >> >> backend
>> >> >>>> >> >> >> processes.
>> >> >>>> >> >> >> Although XC (originally PG) code has a handler for
>> >> >>>> >> >> >> SIGKILL,
>> >> >>>> >> >> >> I
>> >> >>>> >> >> >> didn’t
>> >> >>>> >> >> >> found
>> >> >>>> >> >> >> any code in XC sending SIGKILL to other processes.   I’m
>> >> >>>> >> >> >> afraid
>> >> >>>> >> >> >> there
>> >> >>>> >> >> >> could
>> >> >>>> >> >> >> be another process sending SIGKILL to them.
>> >> >>>> >> >> >>
>> >> >>>> >> >> >> Best Regards;
>> >> >>>> >> >> >> ---
>> >> >>>> >> >> >> Koichi Suzuki
>> >> >>>> >> >> >>
>> >> >>>> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...>
>> >> >>>> >> >> >> のメール：
>> >> >>>> >> >> >>
>> >> >>>> >> >> >> > I am running single instance of postgres-xc with
>> >> >>>> >> >> >> > several
>> >> >>>> >> >> >> > datanode
>> >> >>>> >> >> >> > Each data node runs on its own machine.
>> >> >>>> >> >> >> > After instantiating the cluster the database sits idle.
>> >> >>>> >> >> >> > I
>> >> >>>> >> >> >> > do
>> >> >>>> >> >> >> > not
>> >> >>>> >> >> >> > perform
>> >> >>>> >> >> >> > any create table or insert operations. However, after
>> >> >>>> >> >> >> > some
>> >> >>>> >> >> >> > time
>> >> >>>> >> >> >> > all
>> >> >>>> >> >> >> > the
>> >> >>>> >> >> >> > datanode  automatically shuts down (log messages
>> >> >>>> >> >> >> > attached).
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> > Any pointers as to why this maybe happening would be
>> >> >>>> >> >> >> > very
>> >> >>>> >> >> >> > useful.
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> > -Sandeep
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> > The log files across all datanode contain various error
>> >> >>>> >> >> >> > messages.
>> >> >>>> >> >> >> > These
>> >> >>>> >> >> >> > include:
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> > LOG:  statistics collector process (PID 23073) was
>> >> >>>> >> >> >> > terminated
>> >> >>>> >> >> >> > by
>> >> >>>> >> >> >> > signal
>> >> >>>> >> >> >> > 9: Killed
>> >> >>>> >> >> >> > LOG:  WAL writer process (PID 23071) was terminated by
>> >> >>>> >> >> >> > signal
>> >> >>>> >> >> >> > 9:
>> >> >>>> >> >> >> > Killed
>> >> >>>> >> >> >> > LOG:  terminating any other active server processes
>> >> >>>> >> >> >> > WARNING:  terminating connection because of crash of
>> >> >>>> >> >> >> > another
>> >> >>>> >> >> >> > server
>> >> >>>> >> >> >> > process
>> >> >>>> >> >> >> > DETAIL:  The postmaster has commanded this server
>> >> >>>> >> >> >> > process
>> >> >>>> >> >> >> > to
>> >> >>>> >> >> >> > roll
>> >> >>>> >> >> >> > back
>> >> >>>> >> >> >> > the current transaction and exit, because another
>> >> >>>> >> >> >> > server
>> >> >>>> >> >> >> > process
>> >> >>>> >> >> >> > exited
>> >> >>>> >> >> >> > abnormally and possibly corrupted shared memory.
>> >> >>>> >> >> >> > HINT:  In a moment you should be able to reconnect to
>> >> >>>> >> >> >> > the
>> >> >>>> >> >> >> > database
>> >> >>>> >> >> >> > and
>> >> >>>> >> >> >> > repeat your command.
>> >> >>>> >> >> >> > LOG:  all server processes terminated; reinitializing
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> > LOG:  checkpointer process (PID 21406) was terminated
>> >> >>>> >> >> >> > by
>> >> >>>> >> >> >> > signal 9:
>> >> >>>> >> >> >> > Killed
>> >> >>>> >> >> >> > LOG:  terminating any other active server processes
>> >> >>>> >> >> >> > WARNING:  terminating connection because of crash of
>> >> >>>> >> >> >> > another
>> >> >>>> >> >> >> > server
>> >> >>>> >> >> >> > process
>> >> >>>> >> >> >> > DETAIL:  The postmaster has commanded this server
>> >> >>>> >> >> >> > process
>> >> >>>> >> >> >> > to
>> >> >>>> >> >> >> > roll
>> >> >>>> >> >> >> > back
>> >> >>>> >> >> >> > the current transaction and exit, because another
>> >> >>>> >> >> >> > server
>> >> >>>> >> >> >> > process
>> >> >>>> >> >> >> > exited
>> >> >>>> >> >> >> > abnormally and possibly corrupted shared memory.
>> >> >>>> >> >> >> > LOG:  statistics collector process (PID 28881) was
>> >> >>>> >> >> >> > terminated
>> >> >>>> >> >> >> > by
>> >> >>>> >> >> >> > signal
>> >> >>>> >> >> >> > 9: Killed
>> >> >>>> >> >> >> > LOG:  autovacuum launcher process (PID 28880) was
>> >> >>>> >> >> >> > terminated
>> >> >>>> >> >> >> > by
>> >> >>>> >> >> >> > signal
>> >> >>>> >> >> >> > 9: Killed
>> >> >>>> >> >> >> > LOG:  terminating any other active server processes
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> > ------------------------------------------------------------------------------
>> >> >>>> >> >> >> > November Webinars for C, C++, Fortran Developers
>> >> >>>> >> >> >> > Accelerate application performance with scalable
>> >> >>>> >> >> >> > programming
>> >> >>>> >> >> >> > models.
>> >> >>>> >> >> >> > Explore
>> >> >>>> >> >> >> > techniques for threading, error checking, porting, and
>> >> >>>> >> >> >> > tuning. Get
>> >> >>>> >> >> >> > the
>> >> >>>> >> >> >> > most
>> >> >>>> >> >> >> > from the latest Intel processors and coprocessors. See
>> >> >>>> >> >> >> > abstracts
>> >> >>>> >> >> >> > and
>> >> >>>> >> >> >> > register
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________
>> >> >>>> >> >> >> > Postgres-xc-general mailing list
>> >> >>>> >> >> >> > Pos...@li...
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> >
>> >> >>>> >> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>> >> >>>> >> >> >>
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > ------------------------------------------------------------------------------
>> >> >>>> >> >> > November Webinars for C, C++, Fortran Developers
>> >> >>>> >> >> > Accelerate application performance with scalable
>> >> >>>> >> >> > programming
>> >> >>>> >> >> > models.
>> >> >>>> >> >> > Explore
>> >> >>>> >> >> > techniques for threading, error checking, porting, and
>> >> >>>> >> >> > tuning.
>> >> >>>> >> >> > Get
>> >> >>>> >> >> > the
>> >> >>>> >> >> > most
>> >> >>>> >> >> > from the latest Intel processors and coprocessors. See
>> >> >>>> >> >> > abstracts
>> >> >>>> >> >> > and
>> >> >>>> >> >> > register
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
>> >> >>>> >> >> > _______________________________________________
>> >> >>>> >> >> > Postgres-xc-general mailing list
>> >> >>>> >> >> > Pos...@li...
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>> >> >>>> >> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > ------------------------------------------------------------------------------
>> >> >>>> >> > DreamFactory - Open Source REST & JSON Services for HTML5 &
>> >> >>>> >> > Native
>> >> >>>> >> > Apps
>> >> >>>> >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External
>> >> >>>> >> > API
>> >> >>>> >> > Access
>> >> >>>> >> > Free app hosting. Or install the open source package on any
>> >> >>>> >> > LAMP
>> >> >>>> >> > server.
>> >> >>>> >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch
>> >> >>>> >> > and
>> >> >>>> >> > Native!
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
>> >> >>>> >> > _______________________________________________
>> >> >>>> >> > Postgres-xc-general mailing list
>> >> >>>> >> > Pos...@li...
>> >> >>>> >> >
>> >> >>>> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>> >> >>>> >> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>
>> >> >>>
>> >> >>
>> >> >
>> >
>> >
>
>

Hi Koichi,

 We finally fixed the mystery of the killing postgres processes during
copy. It was nothing to do with postgres-xc. Turns out that some
misconfiguration of our compute system that resulted in killing the
process. Sorry for the trouble. We have executed several runs with the copy
command and they are  working fine so far.

-Sandeep

On Mon, Nov 18, 2013 at 10:17 PM, Koichi Suzuki <koi...@gm...>wrote:

> Thanks.   I'll be looking forward to your test result.   I talked with
> Masataka and he was interested that no relevant messages were found in
> dmsg.
>
> Regards;
> ---
> Koichi Suzuki
>
>
> 2013/11/19 Sandeep Gupta <gup...@gm...>:
> > Hi Koichi,
> >
> >  Thanks for taking a look. I did some further debugging and I won't rule
> > out the possibility of the the OS killing some postgres processes which
> > leads to copy execution reaching in a inconsistent state.
> > Will do some further tests to confirm this as well.
> >
> > -Sandeep
> >
> >
> >
> >
> > On Mon, Nov 18, 2013 at 1:06 AM, Koichi Suzuki <koi...@gm...>
> > wrote:
> >>
> >> This happens when the message is not null-terminated properly.   Most
> >> case is a bug.   In this case, different from vanilla PostgreSQL, copy
> >> data is involved in.    Please let me find a time to look into it.
> >>
> >> Regards;
> >> ---
> >> Koichi Suzuki
> >>
> >>
> >> 2013/11/16 Sandeep Gupta <gup...@gm...>:
> >> > Forgot to mention  that the error is still the same "invalid string in
> >> > message".
> >> >
> >> > -Sandeep
> >> >
> >> >
> >> >
> >> > On Fri, Nov 15, 2013 at 8:27 PM, Sandeep Gupta <
> gup...@gm...>
> >> > wrote:
> >> >>
> >> >> Hi Koichi,
> >> >>
> >> >>  We tried the  copy with the patch but it did not fix the problem for
> >> >> us.
> >> >> If you need any additional details then let me know.
> >> >>
> >> >> Best,
> >> >> -Sandeep
> >> >>
> >> >>
> >> >> On Fri, Nov 15, 2013 at 6:13 AM, Sandeep Gupta
> >> >> <gup...@gm...>
> >> >> wrote:
> >> >>>
> >> >>> Hi Koichi,
> >> >>>
> >> >>>  Many thanks for sharing the patch. We would test this in our
> >> >>> environment
> >> >>> and would report back on the result.
> >> >>>
> >> >>> Regarding the very high memory consumption, it is not the
> coordinator
> >> >>> that sees the high memory usage.
> >> >>> The setup is following: We have one coordinator and 4 datanodes. All
> >> >>> the
> >> >>> datanodes run on separate machine
> >> >>> then the coordinator.  It is the machine where the datanodes are
> >> >>> running
> >> >>> that have 48GB memory usage.
> >> >>> The other caveat is that that "top" doesn't show that it is the
> >> >>> datanodes
> >> >>> processes using the memory.
> >> >>>
> >> >>> In any case, the last run of copy even with the high usage went
> >> >>> through
> >> >>> fine.
> >> >>>
> >> >>>
> >> >>> Best,
> >> >>> Sandeep
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki <
> koi...@gm...>
> >> >>> wrote:
> >> >>>>
> >> >>>> Yes, I've already found this kind of failure.   When disk write is
> >> >>>> quite slower than read, coordinator overshoots.  However, in this
> >> >>>> case, I thought we will see that palloc() overflows.   If each
> piece
> >> >>>> of COPY TO text data is less than 1GB, then each COPY TO succeeds.
> >> >>>> If the write is much slower than read,  each coordinator will
> >> >>>> allocate
> >> >>>> memory for each COPY TO cache going to datanodes, ending up with
> >> >>>> enormous amount of memory consumption and have a good chance to run
> >> >>>> OOM killer.    Your memory usage looks enormous.    Maybe you have
> >> >>>> many COPY TO instances each of which allocates too much memory.
> >> >>>>
> >> >>>> Attached patch fixes COPY TO.   Because we have a chance to have
> >> >>>> similar overshoots in COPY FROM, I'd like to write a patch to fix
> >> >>>> this
> >> >>>> too before committing.   This is for the current master but will
> work
> >> >>>> with 1.1.x (even with 1.0.x hopefully).
> >> >>>>
> >> >>>> The patch looks to be fine.   I asked to test it in such an
> >> >>>> environment and they reported the patch fixed their issue.
> >> >>>>
> >> >>>> I'm not sure but hope this works in your case.   May reduce memory
> >> >>>> consumption drastically.   Please let me know the result.
> >> >>>>
> >> >>>> I have to allocate some of my resource to fix the same issue in
> COPY
> >> >>>> FROM.    I have another request to fix this so if the patch works
> for
> >> >>>> you, I may be able to commit only COPY TO fix first.
> >> >>>>
> >> >>>> Regards;
> >> >>>> ---
> >> >>>> Koichi Suzuki
> >> >>>>
> >> >>>>
> >> >>>> 2013/11/15 Sandeep Gupta <gup...@gm...>:
> >> >>>> > Hi Koichi,
> >> >>>> >
> >> >>>> >  I am using the copy command over the text file to populate a
> >> >>>> > table.
> >> >>>> >
> >> >>>> > Sorry for the confusion. I meant 628 -- 638 lines in the
> >> >>>> > pqformat.c.
> >> >>>> > Let me explain this a bit more. I get  error in some of the
> >> >>>> > datanode
> >> >>>> > logfiles
> >> >>>> > as "invalid string in message". The message is mentioned only in
> >> >>>> > pqformat.c
> >> >>>> > at line 628. Although I am not able to trace back as to where
> >> >>>> > error
> >> >>>> > may
> >> >>>> > have first ordered in the datanode that triggered this error.
> >> >>>> >
> >> >>>> > In order to debug this I started printing out the rows the
> >> >>>> > coordinator
> >> >>>> > is
> >> >>>> > sending to the datanode (in function DataNodeCopyIn). For some
> >> >>>> > reason
> >> >>>> > the
> >> >>>> > error invalid string in messge error has  not happened so far.
>  It
> >> >>>> > seems
> >> >>>> > there is some issue with the flow control of message between
> >> >>>> > coordinator and
> >> >>>> > datanode.
> >> >>>> > Hope this helps in narrowing down what is causing the problem in
> >> >>>> > the
> >> >>>> > first
> >> >>>> > place.
> >> >>>> >
> >> >>>> >
> >> >>>> > However, the new problem I see is that the memory usage goes very
> >> >>>> > high. Of
> >> >>>> > the 48 GB almost all of the memory is getting used. Is this
> normal?
> >> >>>> > If not can you give suggestion as to how to keep the memory
> >> >>>> > consumption
> >> >>>> > low.
> >> >>>> >
> >> >>>> > -Sandeep
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki
> >> >>>> > <koi...@gm...>
> >> >>>> > wrote:
> >> >>>> >>
> >> >>>> >> Could you let me know what "line 682 -- 637" means?
> >> >>>> >>
> >> >>>> >> Because any communication between coordinator and datanode is
> done
> >> >>>> >> using libpq, I don't think XC populates information using raw
> >> >>>> >> stringinfo.   In fact, from datanodes, coordinator is just a
> >> >>>> >> client
> >> >>>> >> except for supplying XID and snapshot.   However,  COPY TO uses
> >> >>>> >> send()/receive(), not through libpq, to send data to datanode.
>  In
> >> >>>> >> this case, if incoming data file contains NULL in the data, such
> >> >>>> >> data
> >> >>>> >> may be sent via send() to datanodes.   COPY TO assumes incoming
> >> >>>> >> data
> >> >>>> >> in a text format.   Do you think you have such situations?
> >> >>>> >>
> >> >>>> >> Analysis may take a bit but let me try.
> >> >>>> >>
> >> >>>> >> Regards;
> >> >>>> >> ---
> >> >>>> >> Koichi Suzuki
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> 2013/11/15 Sandeep Gupta <gup...@gm...>:
> >> >>>> >> > Hi Koichi, Masataka,
> >> >>>> >> >
> >> >>>> >> >  Thanks for taking a look. I double checked both memory usage
> >> >>>> >> > and
> >> >>>> >> > the
> >> >>>> >> > dmesg
> >> >>>> >> > log. The kill is not happening due to the out of memory. The
> >> >>>> >> > kernel
> >> >>>> >> > doesn't
> >> >>>> >> > show any processes being killed.
> >> >>>> >> >
> >> >>>> >> > My current suspicion is that something amiss is happening over
> >> >>>> >> > the
> >> >>>> >> > network.
> >> >>>> >> > This happens consistently when performing copy over large
> >> >>>> >> > datasets.
> >> >>>> >> > Few
> >> >>>> >> > of
> >> >>>> >> > datanodes end up with "invalid string in message".
> >> >>>> >> > Going over the code in pqformat.c this happens because the
> >> >>>> >> > invariant
> >> >>>> >> > "StringInfo is guaranteed to have a trailing null byte" is not
> >> >>>> >> > valid
> >> >>>> >> > (line
> >> >>>> >> > 682 -- 637). I am not sure why this happens.
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> > After this the datanodes do something illegal (probably buffer
> >> >>>> >> > overflow
> >> >>>> >> > on
> >> >>>> >> > the port..just a guess) that the OS decides to issue a
> sigkill.
> >> >>>> >> > I
> >> >>>> >> > am
> >> >>>> >> > also
> >> >>>> >> > not sure why the OS is not logging this killed operation.
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> > Please let me know of  suggestions you may have regarding the
> >> >>>> >> > "invalid
> >> >>>> >> > string in message". In particular, where in the code is
> >> >>>> >> > StringInfo
> >> >>>> >> > msg
> >> >>>> >> > is
> >> >>>> >> > getting populated i.e. which code reads the message over the
> >> >>>> >> > network and
> >> >>>> >> > keeps it ready for the copy operation on the datanode. Any
> other
> >> >>>> >> > suggestion
> >> >>>> >> > would also be helpful to trace this bug.
> >> >>>> >> >
> >> >>>> >> > Best,
> >> >>>> >> > Sandeep
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito
> >> >>>> >> > <pg...@gm...>
> >> >>>> >> > wrote:
> >> >>>> >> >>
> >> >>>> >> >> Hi,
> >> >>>> >> >>
> >> >>>> >> >> I think I may be onto something.
> >> >>>> >> >>
> >> >>>> >> >> When the memory is exhausted, the Linux kernel kills process
> >> >>>> >> >> randomly
> >> >>>> >> >> by SIGKILL.
> >> >>>> >> >> The mechanism is called OOM(Out Of Memory) killer.
> >> >>>> >> >> OOM killer logs its activity to kernel mesage buffer.
> >> >>>> >> >>
> >> >>>> >> >> Could you check display message (dmesg) and memory resource?
> >> >>>> >> >>
> >> >>>> >> >> Regards.
> >> >>>> >> >>
> >> >>>> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta
> >> >>>> >> >> <gup...@gm...>
> >> >>>> >> >> wrote:
> >> >>>> >> >> > Hi Koichi,
> >> >>>> >> >> >
> >> >>>> >> >> >    It is a bit of mystery  because it does not happen
> >> >>>> >> >> > consistently.
> >> >>>> >> >> > Thanks
> >> >>>> >> >> > for the clarification though. It is indeed helpful.
> >> >>>> >> >> >
> >> >>>> >> >> > -Sandeep
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市
> >> >>>> >> >> > <ko...@in...>
> >> >>>> >> >> > wrote:
> >> >>>> >> >> >>
> >> >>>> >> >> >> Someone is sending SIGKILL to coordinator/datanode backend
> >> >>>> >> >> >> processes.
> >> >>>> >> >> >> Although XC (originally PG) code has a handler for
> SIGKILL,
> >> >>>> >> >> >> I
> >> >>>> >> >> >> didn’t
> >> >>>> >> >> >> found
> >> >>>> >> >> >> any code in XC sending SIGKILL to other processes.   I’m
> >> >>>> >> >> >> afraid
> >> >>>> >> >> >> there
> >> >>>> >> >> >> could
> >> >>>> >> >> >> be another process sending SIGKILL to them.
> >> >>>> >> >> >>
> >> >>>> >> >> >> Best Regards;
> >> >>>> >> >> >> ---
> >> >>>> >> >> >> Koichi Suzuki
> >> >>>> >> >> >>
> >> >>>> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...>
> >> >>>> >> >> >> のメール：
> >> >>>> >> >> >>
> >> >>>> >> >> >> > I am running single instance of postgres-xc with
>  several
> >> >>>> >> >> >> > datanode
> >> >>>> >> >> >> > Each data node runs on its own machine.
> >> >>>> >> >> >> > After instantiating the cluster the database sits idle.
> I
> >> >>>> >> >> >> > do
> >> >>>> >> >> >> > not
> >> >>>> >> >> >> > perform
> >> >>>> >> >> >> > any create table or insert operations. However, after
> some
> >> >>>> >> >> >> > time
> >> >>>> >> >> >> > all
> >> >>>> >> >> >> > the
> >> >>>> >> >> >> > datanode  automatically shuts down (log messages
> >> >>>> >> >> >> > attached).
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > Any pointers as to why this maybe happening would be
> very
> >> >>>> >> >> >> > useful.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > -Sandeep
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > The log files across all datanode contain various error
> >> >>>> >> >> >> > messages.
> >> >>>> >> >> >> > These
> >> >>>> >> >> >> > include:
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > LOG:  statistics collector process (PID 23073) was
> >> >>>> >> >> >> > terminated
> >> >>>> >> >> >> > by
> >> >>>> >> >> >> > signal
> >> >>>> >> >> >> > 9: Killed
> >> >>>> >> >> >> > LOG:  WAL writer process (PID 23071) was terminated by
> >> >>>> >> >> >> > signal
> >> >>>> >> >> >> > 9:
> >> >>>> >> >> >> > Killed
> >> >>>> >> >> >> > LOG:  terminating any other active server processes
> >> >>>> >> >> >> > WARNING:  terminating connection because of crash of
> >> >>>> >> >> >> > another
> >> >>>> >> >> >> > server
> >> >>>> >> >> >> > process
> >> >>>> >> >> >> > DETAIL:  The postmaster has commanded this server
> process
> >> >>>> >> >> >> > to
> >> >>>> >> >> >> > roll
> >> >>>> >> >> >> > back
> >> >>>> >> >> >> > the current transaction and exit, because another server
> >> >>>> >> >> >> > process
> >> >>>> >> >> >> > exited
> >> >>>> >> >> >> > abnormally and possibly corrupted shared memory.
> >> >>>> >> >> >> > HINT:  In a moment you should be able to reconnect to
> the
> >> >>>> >> >> >> > database
> >> >>>> >> >> >> > and
> >> >>>> >> >> >> > repeat your command.
> >> >>>> >> >> >> > LOG:  all server processes terminated; reinitializing
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > LOG:  checkpointer process (PID 21406) was terminated by
> >> >>>> >> >> >> > signal 9:
> >> >>>> >> >> >> > Killed
> >> >>>> >> >> >> > LOG:  terminating any other active server processes
> >> >>>> >> >> >> > WARNING:  terminating connection because of crash of
> >> >>>> >> >> >> > another
> >> >>>> >> >> >> > server
> >> >>>> >> >> >> > process
> >> >>>> >> >> >> > DETAIL:  The postmaster has commanded this server
> process
> >> >>>> >> >> >> > to
> >> >>>> >> >> >> > roll
> >> >>>> >> >> >> > back
> >> >>>> >> >> >> > the current transaction and exit, because another server
> >> >>>> >> >> >> > process
> >> >>>> >> >> >> > exited
> >> >>>> >> >> >> > abnormally and possibly corrupted shared memory.
> >> >>>> >> >> >> > LOG:  statistics collector process (PID 28881) was
> >> >>>> >> >> >> > terminated
> >> >>>> >> >> >> > by
> >> >>>> >> >> >> > signal
> >> >>>> >> >> >> > 9: Killed
> >> >>>> >> >> >> > LOG:  autovacuum launcher process (PID 28880) was
> >> >>>> >> >> >> > terminated
> >> >>>> >> >> >> > by
> >> >>>> >> >> >> > signal
> >> >>>> >> >> >> > 9: Killed
> >> >>>> >> >> >> > LOG:  terminating any other active server processes
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> ------------------------------------------------------------------------------
> >> >>>> >> >> >> > November Webinars for C, C++, Fortran Developers
> >> >>>> >> >> >> > Accelerate application performance with scalable
> >> >>>> >> >> >> > programming
> >> >>>> >> >> >> > models.
> >> >>>> >> >> >> > Explore
> >> >>>> >> >> >> > techniques for threading, error checking, porting, and
> >> >>>> >> >> >> > tuning. Get
> >> >>>> >> >> >> > the
> >> >>>> >> >> >> > most
> >> >>>> >> >> >> > from the latest Intel processors and coprocessors. See
> >> >>>> >> >> >> > abstracts
> >> >>>> >> >> >> > and
> >> >>>> >> >> >> > register
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________
> >> >>>> >> >> >> > Postgres-xc-general mailing list
> >> >>>> >> >> >> > Pos...@li...
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
> >> >>>> >> >> >>
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> ------------------------------------------------------------------------------
> >> >>>> >> >> > November Webinars for C, C++, Fortran Developers
> >> >>>> >> >> > Accelerate application performance with scalable
> programming
> >> >>>> >> >> > models.
> >> >>>> >> >> > Explore
> >> >>>> >> >> > techniques for threading, error checking, porting, and
> >> >>>> >> >> > tuning.
> >> >>>> >> >> > Get
> >> >>>> >> >> > the
> >> >>>> >> >> > most
> >> >>>> >> >> > from the latest Intel processors and coprocessors. See
> >> >>>> >> >> > abstracts
> >> >>>> >> >> > and
> >> >>>> >> >> > register
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
> >> >>>> >> >> > _______________________________________________
> >> >>>> >> >> > Postgres-xc-general mailing list
> >> >>>> >> >> > Pos...@li...
> >> >>>> >> >> >
> >> >>>> >> >> >
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
> >> >>>> >> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> ------------------------------------------------------------------------------
> >> >>>> >> > DreamFactory - Open Source REST & JSON Services for HTML5 &
> >> >>>> >> > Native
> >> >>>> >> > Apps
> >> >>>> >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API
> >> >>>> >> > Access
> >> >>>> >> > Free app hosting. Or install the open source package on any
> LAMP
> >> >>>> >> > server.
> >> >>>> >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch
> and
> >> >>>> >> > Native!
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
> >> >>>> >> > _______________________________________________
> >> >>>> >> > Postgres-xc-general mailing list
> >> >>>> >> > Pos...@li...
> >> >>>> >> >
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
> >> >>>> >> >
> >> >>>> >
> >> >>>> >
> >> >>>
> >> >>>
> >> >>
> >> >
> >
> >
>

2010	Jan	Feb	Mar	Apr	May (2)	Jun	Jul	Aug (6)	Sep	Oct (19)	Nov (1)	Dec
2011	Jan (12)	Feb (1)	Mar (4)	Apr (4)	May (32)	Jun (12)	Jul (11)	Aug (1)	Sep (6)	Oct (3)	Nov	Dec (10)
2012	Jan (11)	Feb (1)	Mar (3)	Apr (25)	May (53)	Jun (38)	Jul (103)	Aug (54)	Sep (31)	Oct (66)	Nov (77)	Dec (20)
2013	Jan (91)	Feb (86)	Mar (103)	Apr (107)	May (25)	Jun (37)	Jul (17)	Aug (59)	Sep (38)	Oct (78)	Nov (29)	Dec (15)
2014	Jan (23)	Feb (82)	Mar (118)	Apr (101)	May (103)	Jun (45)	Jul (6)	Aug (10)	Sep	Oct (32)	Nov	Dec (9)
2015	Jan (3)	Feb (5)	Mar	Apr (1)	May	Jun	Jul (9)	Aug (4)	Sep (3)	Oct	Nov	Dec
2016	Jan (3)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun (3)	Jul	Aug	Sep	Oct	Nov	Dec
2018	Jan	Feb	Mar	Apr	May (4)	Jun	Jul	Aug	Sep	Oct	Nov	Dec

S	M	T	W	T	F	S
1	2	3	4	5 (1)	6	7
8	9 (2)	10 (3)	11 (6)	12 (2)	13 (1)	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

postgres-xc-general Mailing List for Postgres-XC

postgres-xc-general — General info and messages