You can subscribe to this list here.
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
(6) |
Sep
|
Oct
(19) |
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 |
Jan
(12) |
Feb
(1) |
Mar
(4) |
Apr
(4) |
May
(32) |
Jun
(12) |
Jul
(11) |
Aug
(1) |
Sep
(6) |
Oct
(3) |
Nov
|
Dec
(10) |
2012 |
Jan
(11) |
Feb
(1) |
Mar
(3) |
Apr
(25) |
May
(53) |
Jun
(38) |
Jul
(103) |
Aug
(54) |
Sep
(31) |
Oct
(66) |
Nov
(77) |
Dec
(20) |
2013 |
Jan
(91) |
Feb
(86) |
Mar
(103) |
Apr
(107) |
May
(25) |
Jun
(37) |
Jul
(17) |
Aug
(59) |
Sep
(38) |
Oct
(78) |
Nov
(29) |
Dec
(15) |
2014 |
Jan
(23) |
Feb
(82) |
Mar
(118) |
Apr
(101) |
May
(103) |
Jun
(45) |
Jul
(6) |
Aug
(10) |
Sep
|
Oct
(32) |
Nov
|
Dec
(9) |
2015 |
Jan
(3) |
Feb
(5) |
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(9) |
Aug
(4) |
Sep
(3) |
Oct
|
Nov
|
Dec
|
2016 |
Jan
(3) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(4) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
|
1
|
2
|
3
|
4
|
5
(1) |
6
(2) |
7
(1) |
8
(2) |
9
|
10
|
11
(4) |
12
|
13
|
14
(3) |
15
(5) |
16
(2) |
17
|
18
(2) |
19
(1) |
20
(2) |
21
|
22
|
23
|
24
|
25
(1) |
26
|
27
(2) |
28
|
29
(1) |
30
|
From: Sandeep G. <gup...@gm...> - 2013-11-14 16:17:12
|
Hi Koichi, Masataka, Thanks for taking a look. I double checked both memory usage and the dmesg log. The kill is not happening due to the out of memory. The kernel doesn't show any processes being killed. My current suspicion is that something amiss is happening over the network. This happens consistently when performing copy over large datasets. Few of datanodes end up with "invalid string in message". Going over the code in pqformat.c this happens because the invariant "StringInfo is guaranteed to have a trailing null byte" is not valid (line 682 -- 637). I am not sure why this happens. After this the datanodes do something illegal (probably buffer overflow on the port..just a guess) that the OS decides to issue a sigkill. I am also not sure why the OS is not logging this killed operation. Please let me know of suggestions you may have regarding the "invalid string in message". In particular, where in the code is StringInfo msg is getting populated i.e. which code reads the message over the network and keeps it ready for the copy operation on the datanode. Any other suggestion would also be helpful to trace this bug. Best, Sandeep On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito <pg...@gm...> wrote: > Hi, > > I think I may be onto something. > > When the memory is exhausted, the Linux kernel kills process randomly > by SIGKILL. > The mechanism is called OOM(Out Of Memory) killer. > OOM killer logs its activity to kernel mesage buffer. > > Could you check display message (dmesg) and memory resource? > > Regards. > > On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta <gup...@gm...> > wrote: > > Hi Koichi, > > > > It is a bit of mystery because it does not happen consistently. > Thanks > > for the clarification though. It is indeed helpful. > > > > -Sandeep > > > > > > > > > > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 <ko...@in...> wrote: > >> > >> Someone is sending SIGKILL to coordinator/datanode backend processes. > >> Although XC (originally PG) code has a handler for SIGKILL, I didn’t > found > >> any code in XC sending SIGKILL to other processes. I’m afraid there > could > >> be another process sending SIGKILL to them. > >> > >> Best Regards; > >> --- > >> Koichi Suzuki > >> > >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: > >> > >> > I am running single instance of postgres-xc with several datanode > >> > Each data node runs on its own machine. > >> > After instantiating the cluster the database sits idle. I do not > perform > >> > any create table or insert operations. However, after some time all > the > >> > datanode automatically shuts down (log messages attached). > >> > > >> > Any pointers as to why this maybe happening would be very useful. > >> > > >> > -Sandeep > >> > > >> > The log files across all datanode contain various error messages. > These > >> > include: > >> > > >> > > >> > LOG: statistics collector process (PID 23073) was terminated by > signal > >> > 9: Killed > >> > LOG: WAL writer process (PID 23071) was terminated by signal 9: > Killed > >> > LOG: terminating any other active server processes > >> > WARNING: terminating connection because of crash of another server > >> > process > >> > DETAIL: The postmaster has commanded this server process to roll back > >> > the current transaction and exit, because another server process > exited > >> > abnormally and possibly corrupted shared memory. > >> > HINT: In a moment you should be able to reconnect to the database and > >> > repeat your command. > >> > LOG: all server processes terminated; reinitializing > >> > > >> > > >> > LOG: checkpointer process (PID 21406) was terminated by signal 9: > >> > Killed > >> > LOG: terminating any other active server processes > >> > WARNING: terminating connection because of crash of another server > >> > process > >> > DETAIL: The postmaster has commanded this server process to roll back > >> > the current transaction and exit, because another server process > exited > >> > abnormally and possibly corrupted shared memory. > >> > LOG: statistics collector process (PID 28881) was terminated by > signal > >> > 9: Killed > >> > LOG: autovacuum launcher process (PID 28880) was terminated by signal > >> > 9: Killed > >> > LOG: terminating any other active server processes > >> > > >> > > >> > > >> > > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > November Webinars for C, C++, Fortran Developers > >> > Accelerate application performance with scalable programming models. > >> > Explore > >> > techniques for threading, error checking, porting, and tuning. Get the > >> > most > >> > from the latest Intel processors and coprocessors. See abstracts and > >> > register > >> > > >> > > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ > >> > Postgres-xc-general mailing list > >> > Pos...@li... > >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >> > > > > > > > ------------------------------------------------------------------------------ > > November Webinars for C, C++, Fortran Developers > > Accelerate application performance with scalable programming models. > Explore > > techniques for threading, error checking, porting, and tuning. Get the > most > > from the latest Intel processors and coprocessors. See abstracts and > > register > > > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk > > _______________________________________________ > > Postgres-xc-general mailing list > > Pos...@li... > > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > > > |
From: Masataka S. <pg...@gm...> - 2013-11-14 15:26:00
|
Hi, I think I may be onto something. When the memory is exhausted, the Linux kernel kills process randomly by SIGKILL. The mechanism is called OOM(Out Of Memory) killer. OOM killer logs its activity to kernel mesage buffer. Could you check display message (dmesg) and memory resource? Regards. On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta <gup...@gm...> wrote: > Hi Koichi, > > It is a bit of mystery because it does not happen consistently. Thanks > for the clarification though. It is indeed helpful. > > -Sandeep > > > > > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 <ko...@in...> wrote: >> >> Someone is sending SIGKILL to coordinator/datanode backend processes. >> Although XC (originally PG) code has a handler for SIGKILL, I didn’t found >> any code in XC sending SIGKILL to other processes. I’m afraid there could >> be another process sending SIGKILL to them. >> >> Best Regards; >> --- >> Koichi Suzuki >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: >> >> > I am running single instance of postgres-xc with several datanode >> > Each data node runs on its own machine. >> > After instantiating the cluster the database sits idle. I do not perform >> > any create table or insert operations. However, after some time all the >> > datanode automatically shuts down (log messages attached). >> > >> > Any pointers as to why this maybe happening would be very useful. >> > >> > -Sandeep >> > >> > The log files across all datanode contain various error messages. These >> > include: >> > >> > >> > LOG: statistics collector process (PID 23073) was terminated by signal >> > 9: Killed >> > LOG: WAL writer process (PID 23071) was terminated by signal 9: Killed >> > LOG: terminating any other active server processes >> > WARNING: terminating connection because of crash of another server >> > process >> > DETAIL: The postmaster has commanded this server process to roll back >> > the current transaction and exit, because another server process exited >> > abnormally and possibly corrupted shared memory. >> > HINT: In a moment you should be able to reconnect to the database and >> > repeat your command. >> > LOG: all server processes terminated; reinitializing >> > >> > >> > LOG: checkpointer process (PID 21406) was terminated by signal 9: >> > Killed >> > LOG: terminating any other active server processes >> > WARNING: terminating connection because of crash of another server >> > process >> > DETAIL: The postmaster has commanded this server process to roll back >> > the current transaction and exit, because another server process exited >> > abnormally and possibly corrupted shared memory. >> > LOG: statistics collector process (PID 28881) was terminated by signal >> > 9: Killed >> > LOG: autovacuum launcher process (PID 28880) was terminated by signal >> > 9: Killed >> > LOG: terminating any other active server processes >> > >> > >> > >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > November Webinars for C, C++, Fortran Developers >> > Accelerate application performance with scalable programming models. >> > Explore >> > techniques for threading, error checking, porting, and tuning. Get the >> > most >> > from the latest Intel processors and coprocessors. See abstracts and >> > register >> > >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ >> > Postgres-xc-general mailing list >> > Pos...@li... >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> > > > ------------------------------------------------------------------------------ > November Webinars for C, C++, Fortran Developers > Accelerate application performance with scalable programming models. Explore > techniques for threading, error checking, porting, and tuning. Get the most > from the latest Intel processors and coprocessors. See abstracts and > register > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > |
From: Casper, S. <Ste...@se...> - 2013-11-14 09:23:45
|
Hello, we want to use a postgres-xc cluster with two or more master nodes. If we start the cluster configuration with a GTM, 2 GTM-proxys, 2 master-coordinators and 2 master-datanodes , we can initialize and use it. Now we want to add or remove datanodes while the database is still running. If we add a new node, we add that one to the coordinators node list. It seems the datanode now knows the database schema, but didn't contain any data. The node is also in write protection mode. If we remove an existing datanode (shutdown its kvm and remove it from the coordinators node list), now the database is not able to execute queries. We are using Postgres-XC 1.1 beta. So, we would like to know which is the right way to add more new master nodes to a running cluster and how can we use the cluster after one of the nodes has crashed? Best regards, Stephan Casper |