You can subscribe to this list here.
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
(6) |
Sep
|
Oct
(19) |
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 |
Jan
(12) |
Feb
(1) |
Mar
(4) |
Apr
(4) |
May
(32) |
Jun
(12) |
Jul
(11) |
Aug
(1) |
Sep
(6) |
Oct
(3) |
Nov
|
Dec
(10) |
2012 |
Jan
(11) |
Feb
(1) |
Mar
(3) |
Apr
(25) |
May
(53) |
Jun
(38) |
Jul
(103) |
Aug
(54) |
Sep
(31) |
Oct
(66) |
Nov
(77) |
Dec
(20) |
2013 |
Jan
(91) |
Feb
(86) |
Mar
(103) |
Apr
(107) |
May
(25) |
Jun
(37) |
Jul
(17) |
Aug
(59) |
Sep
(38) |
Oct
(78) |
Nov
(29) |
Dec
(15) |
2014 |
Jan
(23) |
Feb
(82) |
Mar
(118) |
Apr
(101) |
May
(103) |
Jun
(45) |
Jul
(6) |
Aug
(10) |
Sep
|
Oct
(32) |
Nov
|
Dec
(9) |
2015 |
Jan
(3) |
Feb
(5) |
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(9) |
Aug
(4) |
Sep
(3) |
Oct
|
Nov
|
Dec
|
2016 |
Jan
(3) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(4) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
|
1
|
2
|
3
|
4
|
5
(1) |
6
(2) |
7
(1) |
8
(2) |
9
|
10
|
11
(4) |
12
|
13
|
14
(3) |
15
(5) |
16
(2) |
17
|
18
(2) |
19
(1) |
20
(2) |
21
|
22
|
23
|
24
|
25
(1) |
26
|
27
(2) |
28
|
29
(1) |
30
|
From: Sandeep G. <gup...@gm...> - 2013-11-15 14:13:25
|
Hi Koichi, Many thanks for sharing the patch. We would test this in our environment and would report back on the result. Regarding the very high memory consumption, it is not the coordinator that sees the high memory usage. The setup is following: We have one coordinator and 4 datanodes. All the datanodes run on separate machine then the coordinator. It is the machine where the datanodes are running that have 48GB memory usage. The other caveat is that that "top" doesn't show that it is the datanodes processes using the memory. In any case, the last run of copy even with the high usage went through fine. Best, Sandeep On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki <koi...@gm...>wrote: > Yes, I've already found this kind of failure. When disk write is > quite slower than read, coordinator overshoots. However, in this > case, I thought we will see that palloc() overflows. If each piece > of COPY TO text data is less than 1GB, then each COPY TO succeeds. > If the write is much slower than read, each coordinator will allocate > memory for each COPY TO cache going to datanodes, ending up with > enormous amount of memory consumption and have a good chance to run > OOM killer. Your memory usage looks enormous. Maybe you have > many COPY TO instances each of which allocates too much memory. > > Attached patch fixes COPY TO. Because we have a chance to have > similar overshoots in COPY FROM, I'd like to write a patch to fix this > too before committing. This is for the current master but will work > with 1.1.x (even with 1.0.x hopefully). > > The patch looks to be fine. I asked to test it in such an > environment and they reported the patch fixed their issue. > > I'm not sure but hope this works in your case. May reduce memory > consumption drastically. Please let me know the result. > > I have to allocate some of my resource to fix the same issue in COPY > FROM. I have another request to fix this so if the patch works for > you, I may be able to commit only COPY TO fix first. > > Regards; > --- > Koichi Suzuki > > > 2013/11/15 Sandeep Gupta <gup...@gm...>: > > Hi Koichi, > > > > I am using the copy command over the text file to populate a table. > > > > Sorry for the confusion. I meant 628 -- 638 lines in the pqformat.c. > > Let me explain this a bit more. I get error in some of the datanode > > logfiles > > as "invalid string in message". The message is mentioned only in > pqformat.c > > at line 628. Although I am not able to trace back as to where error may > > have first ordered in the datanode that triggered this error. > > > > In order to debug this I started printing out the rows the coordinator is > > sending to the datanode (in function DataNodeCopyIn). For some reason the > > error invalid string in messge error has not happened so far. It seems > > there is some issue with the flow control of message between coordinator > and > > datanode. > > Hope this helps in narrowing down what is causing the problem in the > first > > place. > > > > > > However, the new problem I see is that the memory usage goes very high. > Of > > the 48 GB almost all of the memory is getting used. Is this normal? > > If not can you give suggestion as to how to keep the memory consumption > > low. > > > > -Sandeep > > > > > > > > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki <koi...@gm...> > > wrote: > >> > >> Could you let me know what "line 682 -- 637" means? > >> > >> Because any communication between coordinator and datanode is done > >> using libpq, I don't think XC populates information using raw > >> stringinfo. In fact, from datanodes, coordinator is just a client > >> except for supplying XID and snapshot. However, COPY TO uses > >> send()/receive(), not through libpq, to send data to datanode. In > >> this case, if incoming data file contains NULL in the data, such data > >> may be sent via send() to datanodes. COPY TO assumes incoming data > >> in a text format. Do you think you have such situations? > >> > >> Analysis may take a bit but let me try. > >> > >> Regards; > >> --- > >> Koichi Suzuki > >> > >> > >> 2013/11/15 Sandeep Gupta <gup...@gm...>: > >> > Hi Koichi, Masataka, > >> > > >> > Thanks for taking a look. I double checked both memory usage and the > >> > dmesg > >> > log. The kill is not happening due to the out of memory. The kernel > >> > doesn't > >> > show any processes being killed. > >> > > >> > My current suspicion is that something amiss is happening over the > >> > network. > >> > This happens consistently when performing copy over large datasets. > Few > >> > of > >> > datanodes end up with "invalid string in message". > >> > Going over the code in pqformat.c this happens because the invariant > >> > "StringInfo is guaranteed to have a trailing null byte" is not valid > >> > (line > >> > 682 -- 637). I am not sure why this happens. > >> > > >> > > >> > After this the datanodes do something illegal (probably buffer > overflow > >> > on > >> > the port..just a guess) that the OS decides to issue a sigkill. I am > >> > also > >> > not sure why the OS is not logging this killed operation. > >> > > >> > > >> > Please let me know of suggestions you may have regarding the "invalid > >> > string in message". In particular, where in the code is StringInfo msg > >> > is > >> > getting populated i.e. which code reads the message over the network > and > >> > keeps it ready for the copy operation on the datanode. Any other > >> > suggestion > >> > would also be helpful to trace this bug. > >> > > >> > Best, > >> > Sandeep > >> > > >> > > >> > > >> > > >> > > >> > > >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito <pg...@gm...> > >> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> I think I may be onto something. > >> >> > >> >> When the memory is exhausted, the Linux kernel kills process randomly > >> >> by SIGKILL. > >> >> The mechanism is called OOM(Out Of Memory) killer. > >> >> OOM killer logs its activity to kernel mesage buffer. > >> >> > >> >> Could you check display message (dmesg) and memory resource? > >> >> > >> >> Regards. > >> >> > >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta > >> >> <gup...@gm...> > >> >> wrote: > >> >> > Hi Koichi, > >> >> > > >> >> > It is a bit of mystery because it does not happen consistently. > >> >> > Thanks > >> >> > for the clarification though. It is indeed helpful. > >> >> > > >> >> > -Sandeep > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 <ko...@in...> > >> >> > wrote: > >> >> >> > >> >> >> Someone is sending SIGKILL to coordinator/datanode backend > >> >> >> processes. > >> >> >> Although XC (originally PG) code has a handler for SIGKILL, I > didn’t > >> >> >> found > >> >> >> any code in XC sending SIGKILL to other processes. I’m afraid > >> >> >> there > >> >> >> could > >> >> >> be another process sending SIGKILL to them. > >> >> >> > >> >> >> Best Regards; > >> >> >> --- > >> >> >> Koichi Suzuki > >> >> >> > >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: > >> >> >> > >> >> >> > I am running single instance of postgres-xc with several > datanode > >> >> >> > Each data node runs on its own machine. > >> >> >> > After instantiating the cluster the database sits idle. I do not > >> >> >> > perform > >> >> >> > any create table or insert operations. However, after some time > >> >> >> > all > >> >> >> > the > >> >> >> > datanode automatically shuts down (log messages attached). > >> >> >> > > >> >> >> > Any pointers as to why this maybe happening would be very > useful. > >> >> >> > > >> >> >> > -Sandeep > >> >> >> > > >> >> >> > The log files across all datanode contain various error > messages. > >> >> >> > These > >> >> >> > include: > >> >> >> > > >> >> >> > > >> >> >> > LOG: statistics collector process (PID 23073) was terminated by > >> >> >> > signal > >> >> >> > 9: Killed > >> >> >> > LOG: WAL writer process (PID 23071) was terminated by signal 9: > >> >> >> > Killed > >> >> >> > LOG: terminating any other active server processes > >> >> >> > WARNING: terminating connection because of crash of another > >> >> >> > server > >> >> >> > process > >> >> >> > DETAIL: The postmaster has commanded this server process to > roll > >> >> >> > back > >> >> >> > the current transaction and exit, because another server process > >> >> >> > exited > >> >> >> > abnormally and possibly corrupted shared memory. > >> >> >> > HINT: In a moment you should be able to reconnect to the > database > >> >> >> > and > >> >> >> > repeat your command. > >> >> >> > LOG: all server processes terminated; reinitializing > >> >> >> > > >> >> >> > > >> >> >> > LOG: checkpointer process (PID 21406) was terminated by signal > 9: > >> >> >> > Killed > >> >> >> > LOG: terminating any other active server processes > >> >> >> > WARNING: terminating connection because of crash of another > >> >> >> > server > >> >> >> > process > >> >> >> > DETAIL: The postmaster has commanded this server process to > roll > >> >> >> > back > >> >> >> > the current transaction and exit, because another server process > >> >> >> > exited > >> >> >> > abnormally and possibly corrupted shared memory. > >> >> >> > LOG: statistics collector process (PID 28881) was terminated by > >> >> >> > signal > >> >> >> > 9: Killed > >> >> >> > LOG: autovacuum launcher process (PID 28880) was terminated by > >> >> >> > signal > >> >> >> > 9: Killed > >> >> >> > LOG: terminating any other active server processes > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > ------------------------------------------------------------------------------ > >> >> >> > November Webinars for C, C++, Fortran Developers > >> >> >> > Accelerate application performance with scalable programming > >> >> >> > models. > >> >> >> > Explore > >> >> >> > techniques for threading, error checking, porting, and tuning. > Get > >> >> >> > the > >> >> >> > most > >> >> >> > from the latest Intel processors and coprocessors. See abstracts > >> >> >> > and > >> >> >> > register > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ > >> >> >> > Postgres-xc-general mailing list > >> >> >> > Pos...@li... > >> >> >> > > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > ------------------------------------------------------------------------------ > >> >> > November Webinars for C, C++, Fortran Developers > >> >> > Accelerate application performance with scalable programming > models. > >> >> > Explore > >> >> > techniques for threading, error checking, porting, and tuning. Get > >> >> > the > >> >> > most > >> >> > from the latest Intel processors and coprocessors. See abstracts > and > >> >> > register > >> >> > > >> >> > > >> >> > > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk > >> >> > _______________________________________________ > >> >> > Postgres-xc-general mailing list > >> >> > Pos...@li... > >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >> >> > > >> > > >> > > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > DreamFactory - Open Source REST & JSON Services for HTML5 & Native > Apps > >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access > >> > Free app hosting. Or install the open source package on any LAMP > server. > >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch and > Native! > >> > > >> > > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk > >> > _______________________________________________ > >> > Postgres-xc-general mailing list > >> > Pos...@li... > >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >> > > > > > > |
From: Koichi S. <koi...@gm...> - 2013-11-15 07:29:11
|
Yes, I've already found this kind of failure. When disk write is quite slower than read, coordinator overshoots. However, in this case, I thought we will see that palloc() overflows. If each piece of COPY TO text data is less than 1GB, then each COPY TO succeeds. If the write is much slower than read, each coordinator will allocate memory for each COPY TO cache going to datanodes, ending up with enormous amount of memory consumption and have a good chance to run OOM killer. Your memory usage looks enormous. Maybe you have many COPY TO instances each of which allocates too much memory. Attached patch fixes COPY TO. Because we have a chance to have similar overshoots in COPY FROM, I'd like to write a patch to fix this too before committing. This is for the current master but will work with 1.1.x (even with 1.0.x hopefully). The patch looks to be fine. I asked to test it in such an environment and they reported the patch fixed their issue. I'm not sure but hope this works in your case. May reduce memory consumption drastically. Please let me know the result. I have to allocate some of my resource to fix the same issue in COPY FROM. I have another request to fix this so if the patch works for you, I may be able to commit only COPY TO fix first. Regards; --- Koichi Suzuki 2013/11/15 Sandeep Gupta <gup...@gm...>: > Hi Koichi, > > I am using the copy command over the text file to populate a table. > > Sorry for the confusion. I meant 628 -- 638 lines in the pqformat.c. > Let me explain this a bit more. I get error in some of the datanode > logfiles > as "invalid string in message". The message is mentioned only in pqformat.c > at line 628. Although I am not able to trace back as to where error may > have first ordered in the datanode that triggered this error. > > In order to debug this I started printing out the rows the coordinator is > sending to the datanode (in function DataNodeCopyIn). For some reason the > error invalid string in messge error has not happened so far. It seems > there is some issue with the flow control of message between coordinator and > datanode. > Hope this helps in narrowing down what is causing the problem in the first > place. > > > However, the new problem I see is that the memory usage goes very high. Of > the 48 GB almost all of the memory is getting used. Is this normal? > If not can you give suggestion as to how to keep the memory consumption > low. > > -Sandeep > > > > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki <koi...@gm...> > wrote: >> >> Could you let me know what "line 682 -- 637" means? >> >> Because any communication between coordinator and datanode is done >> using libpq, I don't think XC populates information using raw >> stringinfo. In fact, from datanodes, coordinator is just a client >> except for supplying XID and snapshot. However, COPY TO uses >> send()/receive(), not through libpq, to send data to datanode. In >> this case, if incoming data file contains NULL in the data, such data >> may be sent via send() to datanodes. COPY TO assumes incoming data >> in a text format. Do you think you have such situations? >> >> Analysis may take a bit but let me try. >> >> Regards; >> --- >> Koichi Suzuki >> >> >> 2013/11/15 Sandeep Gupta <gup...@gm...>: >> > Hi Koichi, Masataka, >> > >> > Thanks for taking a look. I double checked both memory usage and the >> > dmesg >> > log. The kill is not happening due to the out of memory. The kernel >> > doesn't >> > show any processes being killed. >> > >> > My current suspicion is that something amiss is happening over the >> > network. >> > This happens consistently when performing copy over large datasets. Few >> > of >> > datanodes end up with "invalid string in message". >> > Going over the code in pqformat.c this happens because the invariant >> > "StringInfo is guaranteed to have a trailing null byte" is not valid >> > (line >> > 682 -- 637). I am not sure why this happens. >> > >> > >> > After this the datanodes do something illegal (probably buffer overflow >> > on >> > the port..just a guess) that the OS decides to issue a sigkill. I am >> > also >> > not sure why the OS is not logging this killed operation. >> > >> > >> > Please let me know of suggestions you may have regarding the "invalid >> > string in message". In particular, where in the code is StringInfo msg >> > is >> > getting populated i.e. which code reads the message over the network and >> > keeps it ready for the copy operation on the datanode. Any other >> > suggestion >> > would also be helpful to trace this bug. >> > >> > Best, >> > Sandeep >> > >> > >> > >> > >> > >> > >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito <pg...@gm...> >> > wrote: >> >> >> >> Hi, >> >> >> >> I think I may be onto something. >> >> >> >> When the memory is exhausted, the Linux kernel kills process randomly >> >> by SIGKILL. >> >> The mechanism is called OOM(Out Of Memory) killer. >> >> OOM killer logs its activity to kernel mesage buffer. >> >> >> >> Could you check display message (dmesg) and memory resource? >> >> >> >> Regards. >> >> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta >> >> <gup...@gm...> >> >> wrote: >> >> > Hi Koichi, >> >> > >> >> > It is a bit of mystery because it does not happen consistently. >> >> > Thanks >> >> > for the clarification though. It is indeed helpful. >> >> > >> >> > -Sandeep >> >> > >> >> > >> >> > >> >> > >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 <ko...@in...> >> >> > wrote: >> >> >> >> >> >> Someone is sending SIGKILL to coordinator/datanode backend >> >> >> processes. >> >> >> Although XC (originally PG) code has a handler for SIGKILL, I didn’t >> >> >> found >> >> >> any code in XC sending SIGKILL to other processes. I’m afraid >> >> >> there >> >> >> could >> >> >> be another process sending SIGKILL to them. >> >> >> >> >> >> Best Regards; >> >> >> --- >> >> >> Koichi Suzuki >> >> >> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: >> >> >> >> >> >> > I am running single instance of postgres-xc with several datanode >> >> >> > Each data node runs on its own machine. >> >> >> > After instantiating the cluster the database sits idle. I do not >> >> >> > perform >> >> >> > any create table or insert operations. However, after some time >> >> >> > all >> >> >> > the >> >> >> > datanode automatically shuts down (log messages attached). >> >> >> > >> >> >> > Any pointers as to why this maybe happening would be very useful. >> >> >> > >> >> >> > -Sandeep >> >> >> > >> >> >> > The log files across all datanode contain various error messages. >> >> >> > These >> >> >> > include: >> >> >> > >> >> >> > >> >> >> > LOG: statistics collector process (PID 23073) was terminated by >> >> >> > signal >> >> >> > 9: Killed >> >> >> > LOG: WAL writer process (PID 23071) was terminated by signal 9: >> >> >> > Killed >> >> >> > LOG: terminating any other active server processes >> >> >> > WARNING: terminating connection because of crash of another >> >> >> > server >> >> >> > process >> >> >> > DETAIL: The postmaster has commanded this server process to roll >> >> >> > back >> >> >> > the current transaction and exit, because another server process >> >> >> > exited >> >> >> > abnormally and possibly corrupted shared memory. >> >> >> > HINT: In a moment you should be able to reconnect to the database >> >> >> > and >> >> >> > repeat your command. >> >> >> > LOG: all server processes terminated; reinitializing >> >> >> > >> >> >> > >> >> >> > LOG: checkpointer process (PID 21406) was terminated by signal 9: >> >> >> > Killed >> >> >> > LOG: terminating any other active server processes >> >> >> > WARNING: terminating connection because of crash of another >> >> >> > server >> >> >> > process >> >> >> > DETAIL: The postmaster has commanded this server process to roll >> >> >> > back >> >> >> > the current transaction and exit, because another server process >> >> >> > exited >> >> >> > abnormally and possibly corrupted shared memory. >> >> >> > LOG: statistics collector process (PID 28881) was terminated by >> >> >> > signal >> >> >> > 9: Killed >> >> >> > LOG: autovacuum launcher process (PID 28880) was terminated by >> >> >> > signal >> >> >> > 9: Killed >> >> >> > LOG: terminating any other active server processes >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > ------------------------------------------------------------------------------ >> >> >> > November Webinars for C, C++, Fortran Developers >> >> >> > Accelerate application performance with scalable programming >> >> >> > models. >> >> >> > Explore >> >> >> > techniques for threading, error checking, porting, and tuning. Get >> >> >> > the >> >> >> > most >> >> >> > from the latest Intel processors and coprocessors. See abstracts >> >> >> > and >> >> >> > register >> >> >> > >> >> >> > >> >> >> > >> >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ >> >> >> > Postgres-xc-general mailing list >> >> >> > Pos...@li... >> >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> >> >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > November Webinars for C, C++, Fortran Developers >> >> > Accelerate application performance with scalable programming models. >> >> > Explore >> >> > techniques for threading, error checking, porting, and tuning. Get >> >> > the >> >> > most >> >> > from the latest Intel processors and coprocessors. See abstracts and >> >> > register >> >> > >> >> > >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk >> >> > _______________________________________________ >> >> > Postgres-xc-general mailing list >> >> > Pos...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> > >> > >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access >> > Free app hosting. Or install the open source package on any LAMP server. >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! >> > >> > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk >> > _______________________________________________ >> > Postgres-xc-general mailing list >> > Pos...@li... >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> > > > |
From: Sandeep G. <gup...@gm...> - 2013-11-15 07:15:02
|
Hi Koichi, I am using the copy command over the text file to populate a table. Sorry for the confusion. I meant 628 -- 638 lines in the pqformat.c. Let me explain this a bit more. I get error in some of the datanode logfiles as "invalid string in message". The message is mentioned only in pqformat.c at line 628. Although I am not able to trace back as to where error may have first ordered in the datanode that triggered this error. In order to debug this I started printing out the rows the coordinator is sending to the datanode (in function DataNodeCopyIn). For some reason the error invalid string in messge error has not happened so far. It seems there is some issue with the flow control of message between coordinator and datanode. Hope this helps in narrowing down what is causing the problem in the first place. However, the new problem I see is that the memory usage goes very high. Of the 48 GB almost all of the memory is getting used. Is this normal? If not can you give suggestion as to how to keep the memory consumption low. -Sandeep On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki <koi...@gm...>wrote: > Could you let me know what "line 682 -- 637" means? > > Because any communication between coordinator and datanode is done > using libpq, I don't think XC populates information using raw > stringinfo. In fact, from datanodes, coordinator is just a client > except for supplying XID and snapshot. However, COPY TO uses > send()/receive(), not through libpq, to send data to datanode. In > this case, if incoming data file contains NULL in the data, such data > may be sent via send() to datanodes. COPY TO assumes incoming data > in a text format. Do you think you have such situations? > > Analysis may take a bit but let me try. > > Regards; > --- > Koichi Suzuki > > > 2013/11/15 Sandeep Gupta <gup...@gm...>: > > Hi Koichi, Masataka, > > > > Thanks for taking a look. I double checked both memory usage and the > dmesg > > log. The kill is not happening due to the out of memory. The kernel > doesn't > > show any processes being killed. > > > > My current suspicion is that something amiss is happening over the > network. > > This happens consistently when performing copy over large datasets. Few > of > > datanodes end up with "invalid string in message". > > Going over the code in pqformat.c this happens because the invariant > > "StringInfo is guaranteed to have a trailing null byte" is not valid > (line > > 682 -- 637). I am not sure why this happens. > > > > > > After this the datanodes do something illegal (probably buffer overflow > on > > the port..just a guess) that the OS decides to issue a sigkill. I am > also > > not sure why the OS is not logging this killed operation. > > > > > > Please let me know of suggestions you may have regarding the "invalid > > string in message". In particular, where in the code is StringInfo msg is > > getting populated i.e. which code reads the message over the network and > > keeps it ready for the copy operation on the datanode. Any other > suggestion > > would also be helpful to trace this bug. > > > > Best, > > Sandeep > > > > > > > > > > > > > > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito <pg...@gm...> > wrote: > >> > >> Hi, > >> > >> I think I may be onto something. > >> > >> When the memory is exhausted, the Linux kernel kills process randomly > >> by SIGKILL. > >> The mechanism is called OOM(Out Of Memory) killer. > >> OOM killer logs its activity to kernel mesage buffer. > >> > >> Could you check display message (dmesg) and memory resource? > >> > >> Regards. > >> > >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta <gup...@gm... > > > >> wrote: > >> > Hi Koichi, > >> > > >> > It is a bit of mystery because it does not happen consistently. > >> > Thanks > >> > for the clarification though. It is indeed helpful. > >> > > >> > -Sandeep > >> > > >> > > >> > > >> > > >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 <ko...@in...> > wrote: > >> >> > >> >> Someone is sending SIGKILL to coordinator/datanode backend processes. > >> >> Although XC (originally PG) code has a handler for SIGKILL, I didn’t > >> >> found > >> >> any code in XC sending SIGKILL to other processes. I’m afraid there > >> >> could > >> >> be another process sending SIGKILL to them. > >> >> > >> >> Best Regards; > >> >> --- > >> >> Koichi Suzuki > >> >> > >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: > >> >> > >> >> > I am running single instance of postgres-xc with several datanode > >> >> > Each data node runs on its own machine. > >> >> > After instantiating the cluster the database sits idle. I do not > >> >> > perform > >> >> > any create table or insert operations. However, after some time all > >> >> > the > >> >> > datanode automatically shuts down (log messages attached). > >> >> > > >> >> > Any pointers as to why this maybe happening would be very useful. > >> >> > > >> >> > -Sandeep > >> >> > > >> >> > The log files across all datanode contain various error messages. > >> >> > These > >> >> > include: > >> >> > > >> >> > > >> >> > LOG: statistics collector process (PID 23073) was terminated by > >> >> > signal > >> >> > 9: Killed > >> >> > LOG: WAL writer process (PID 23071) was terminated by signal 9: > >> >> > Killed > >> >> > LOG: terminating any other active server processes > >> >> > WARNING: terminating connection because of crash of another server > >> >> > process > >> >> > DETAIL: The postmaster has commanded this server process to roll > >> >> > back > >> >> > the current transaction and exit, because another server process > >> >> > exited > >> >> > abnormally and possibly corrupted shared memory. > >> >> > HINT: In a moment you should be able to reconnect to the database > >> >> > and > >> >> > repeat your command. > >> >> > LOG: all server processes terminated; reinitializing > >> >> > > >> >> > > >> >> > LOG: checkpointer process (PID 21406) was terminated by signal 9: > >> >> > Killed > >> >> > LOG: terminating any other active server processes > >> >> > WARNING: terminating connection because of crash of another server > >> >> > process > >> >> > DETAIL: The postmaster has commanded this server process to roll > >> >> > back > >> >> > the current transaction and exit, because another server process > >> >> > exited > >> >> > abnormally and possibly corrupted shared memory. > >> >> > LOG: statistics collector process (PID 28881) was terminated by > >> >> > signal > >> >> > 9: Killed > >> >> > LOG: autovacuum launcher process (PID 28880) was terminated by > >> >> > signal > >> >> > 9: Killed > >> >> > LOG: terminating any other active server processes > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > ------------------------------------------------------------------------------ > >> >> > November Webinars for C, C++, Fortran Developers > >> >> > Accelerate application performance with scalable programming > models. > >> >> > Explore > >> >> > techniques for threading, error checking, porting, and tuning. Get > >> >> > the > >> >> > most > >> >> > from the latest Intel processors and coprocessors. See abstracts > and > >> >> > register > >> >> > > >> >> > > >> >> > > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ > >> >> > Postgres-xc-general mailing list > >> >> > Pos...@li... > >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >> >> > >> > > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > November Webinars for C, C++, Fortran Developers > >> > Accelerate application performance with scalable programming models. > >> > Explore > >> > techniques for threading, error checking, porting, and tuning. Get the > >> > most > >> > from the latest Intel processors and coprocessors. See abstracts and > >> > register > >> > > >> > > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk > >> > _______________________________________________ > >> > Postgres-xc-general mailing list > >> > Pos...@li... > >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >> > > > > > > > > > > ------------------------------------------------------------------------------ > > DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps > > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access > > Free app hosting. Or install the open source package on any LAMP server. > > Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! > > > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk > > _______________________________________________ > > Postgres-xc-general mailing list > > Pos...@li... > > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > > > |
From: Koichi S. <koi...@gm...> - 2013-11-15 04:57:35
|
Could you let me know what "line 682 -- 637" means? Because any communication between coordinator and datanode is done using libpq, I don't think XC populates information using raw stringinfo. In fact, from datanodes, coordinator is just a client except for supplying XID and snapshot. However, COPY TO uses send()/receive(), not through libpq, to send data to datanode. In this case, if incoming data file contains NULL in the data, such data may be sent via send() to datanodes. COPY TO assumes incoming data in a text format. Do you think you have such situations? Analysis may take a bit but let me try. Regards; --- Koichi Suzuki 2013/11/15 Sandeep Gupta <gup...@gm...>: > Hi Koichi, Masataka, > > Thanks for taking a look. I double checked both memory usage and the dmesg > log. The kill is not happening due to the out of memory. The kernel doesn't > show any processes being killed. > > My current suspicion is that something amiss is happening over the network. > This happens consistently when performing copy over large datasets. Few of > datanodes end up with "invalid string in message". > Going over the code in pqformat.c this happens because the invariant > "StringInfo is guaranteed to have a trailing null byte" is not valid (line > 682 -- 637). I am not sure why this happens. > > > After this the datanodes do something illegal (probably buffer overflow on > the port..just a guess) that the OS decides to issue a sigkill. I am also > not sure why the OS is not logging this killed operation. > > > Please let me know of suggestions you may have regarding the "invalid > string in message". In particular, where in the code is StringInfo msg is > getting populated i.e. which code reads the message over the network and > keeps it ready for the copy operation on the datanode. Any other suggestion > would also be helpful to trace this bug. > > Best, > Sandeep > > > > > > > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito <pg...@gm...> wrote: >> >> Hi, >> >> I think I may be onto something. >> >> When the memory is exhausted, the Linux kernel kills process randomly >> by SIGKILL. >> The mechanism is called OOM(Out Of Memory) killer. >> OOM killer logs its activity to kernel mesage buffer. >> >> Could you check display message (dmesg) and memory resource? >> >> Regards. >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta <gup...@gm...> >> wrote: >> > Hi Koichi, >> > >> > It is a bit of mystery because it does not happen consistently. >> > Thanks >> > for the clarification though. It is indeed helpful. >> > >> > -Sandeep >> > >> > >> > >> > >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市 <ko...@in...> wrote: >> >> >> >> Someone is sending SIGKILL to coordinator/datanode backend processes. >> >> Although XC (originally PG) code has a handler for SIGKILL, I didn’t >> >> found >> >> any code in XC sending SIGKILL to other processes. I’m afraid there >> >> could >> >> be another process sending SIGKILL to them. >> >> >> >> Best Regards; >> >> --- >> >> Koichi Suzuki >> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...> のメール: >> >> >> >> > I am running single instance of postgres-xc with several datanode >> >> > Each data node runs on its own machine. >> >> > After instantiating the cluster the database sits idle. I do not >> >> > perform >> >> > any create table or insert operations. However, after some time all >> >> > the >> >> > datanode automatically shuts down (log messages attached). >> >> > >> >> > Any pointers as to why this maybe happening would be very useful. >> >> > >> >> > -Sandeep >> >> > >> >> > The log files across all datanode contain various error messages. >> >> > These >> >> > include: >> >> > >> >> > >> >> > LOG: statistics collector process (PID 23073) was terminated by >> >> > signal >> >> > 9: Killed >> >> > LOG: WAL writer process (PID 23071) was terminated by signal 9: >> >> > Killed >> >> > LOG: terminating any other active server processes >> >> > WARNING: terminating connection because of crash of another server >> >> > process >> >> > DETAIL: The postmaster has commanded this server process to roll >> >> > back >> >> > the current transaction and exit, because another server process >> >> > exited >> >> > abnormally and possibly corrupted shared memory. >> >> > HINT: In a moment you should be able to reconnect to the database >> >> > and >> >> > repeat your command. >> >> > LOG: all server processes terminated; reinitializing >> >> > >> >> > >> >> > LOG: checkpointer process (PID 21406) was terminated by signal 9: >> >> > Killed >> >> > LOG: terminating any other active server processes >> >> > WARNING: terminating connection because of crash of another server >> >> > process >> >> > DETAIL: The postmaster has commanded this server process to roll >> >> > back >> >> > the current transaction and exit, because another server process >> >> > exited >> >> > abnormally and possibly corrupted shared memory. >> >> > LOG: statistics collector process (PID 28881) was terminated by >> >> > signal >> >> > 9: Killed >> >> > LOG: autovacuum launcher process (PID 28880) was terminated by >> >> > signal >> >> > 9: Killed >> >> > LOG: terminating any other active server processes >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > November Webinars for C, C++, Fortran Developers >> >> > Accelerate application performance with scalable programming models. >> >> > Explore >> >> > techniques for threading, error checking, porting, and tuning. Get >> >> > the >> >> > most >> >> > from the latest Intel processors and coprocessors. See abstracts and >> >> > register >> >> > >> >> > >> >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________ >> >> > Postgres-xc-general mailing list >> >> > Pos...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> >> >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > November Webinars for C, C++, Fortran Developers >> > Accelerate application performance with scalable programming models. >> > Explore >> > techniques for threading, error checking, porting, and tuning. Get the >> > most >> > from the latest Intel processors and coprocessors. See abstracts and >> > register >> > >> > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk >> > _______________________________________________ >> > Postgres-xc-general mailing list >> > Pos...@li... >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> > > > > > ------------------------------------------------------------------------------ > DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access > Free app hosting. Or install the open source package on any LAMP server. > Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > |
From: Koichi S. <koi...@gm...> - 2013-11-15 03:33:50
|
pgxc_ctl contrib module comes with this capability. To use this, you should configure and operate XC cluster with this utility. The document will be found in XC documentation. Online documentation will be found at http://postgres-xc.sourceforge.net/docs/1_1/ If you'd like to do this manually, the following pgxc_ctl source will be helpful: add_coordinatorMaster() in coord_cmd.c: describes how to add coordinator master. add_datanodeMaster() in datanode_cmd.c: describes how to add datanode master. Hope they help. Regards; --- Koichi Suzuki 2013/11/14 Casper, Stephan <Ste...@se...>: > Hello, > > > > we want to use a postgres-xc cluster with two or more master nodes. > > If we start the cluster configuration with a GTM, 2 GTM-proxys, 2 > master-coordinators and 2 master-datanodes , we can initialize and use it. > > > > Now we want to add or remove datanodes while the database is still running. > > If we add a new node, we add that one to the coordinators node list. It > seems the datanode now knows the database schema, but didn’t contain any > data. The node is also in write protection mode. > > If we remove an existing datanode (shutdown its kvm and remove it from the > coordinators node list), now the database is not able to execute queries. > > > > We are using Postgres-XC 1.1 beta. > > > > So, we would like to know which is the right way to add more new master > nodes to a running cluster and how can we use the cluster after one of the > nodes has crashed? > > > > Best regards, > > > > Stephan Casper > > > ------------------------------------------------------------------------------ > DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access > Free app hosting. Or install the open source package on any LAMP server. > Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > |