postgres-xc-general Mailing List for Postgres-XC

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Koichi,

 We finally fixed the mystery of the killing postgres processes during
copy. It was nothing to do with postgres-xc. Turns out that some
misconfiguration of our compute system that resulted in killing the
process. Sorry for the trouble. We have executed several runs with the copy
command and they are  working fine so far.

-Sandeep

On Mon, Nov 18, 2013 at 10:17 PM, Koichi Suzuki <koi...@gm...>wrote:

> Thanks.   I'll be looking forward to your test result.   I talked with
> Masataka and he was interested that no relevant messages were found in
> dmsg.
>
> Regards;
> ---
> Koichi Suzuki
>
>
> 2013/11/19 Sandeep Gupta <gup...@gm...>:
> > Hi Koichi,
> >
> >  Thanks for taking a look. I did some further debugging and I won't rule
> > out the possibility of the the OS killing some postgres processes which
> > leads to copy execution reaching in a inconsistent state.
> > Will do some further tests to confirm this as well.
> >
> > -Sandeep
> >
> >
> >
> >
> > On Mon, Nov 18, 2013 at 1:06 AM, Koichi Suzuki <koi...@gm...>
> > wrote:
> >>
> >> This happens when the message is not null-terminated properly.   Most
> >> case is a bug.   In this case, different from vanilla PostgreSQL, copy
> >> data is involved in.    Please let me find a time to look into it.
> >>
> >> Regards;
> >> ---
> >> Koichi Suzuki
> >>
> >>
> >> 2013/11/16 Sandeep Gupta <gup...@gm...>:
> >> > Forgot to mention  that the error is still the same "invalid string in
> >> > message".
> >> >
> >> > -Sandeep
> >> >
> >> >
> >> >
> >> > On Fri, Nov 15, 2013 at 8:27 PM, Sandeep Gupta <
> gup...@gm...>
> >> > wrote:
> >> >>
> >> >> Hi Koichi,
> >> >>
> >> >>  We tried the  copy with the patch but it did not fix the problem for
> >> >> us.
> >> >> If you need any additional details then let me know.
> >> >>
> >> >> Best,
> >> >> -Sandeep
> >> >>
> >> >>
> >> >> On Fri, Nov 15, 2013 at 6:13 AM, Sandeep Gupta
> >> >> <gup...@gm...>
> >> >> wrote:
> >> >>>
> >> >>> Hi Koichi,
> >> >>>
> >> >>>  Many thanks for sharing the patch. We would test this in our
> >> >>> environment
> >> >>> and would report back on the result.
> >> >>>
> >> >>> Regarding the very high memory consumption, it is not the
> coordinator
> >> >>> that sees the high memory usage.
> >> >>> The setup is following: We have one coordinator and 4 datanodes. All
> >> >>> the
> >> >>> datanodes run on separate machine
> >> >>> then the coordinator.  It is the machine where the datanodes are
> >> >>> running
> >> >>> that have 48GB memory usage.
> >> >>> The other caveat is that that "top" doesn't show that it is the
> >> >>> datanodes
> >> >>> processes using the memory.
> >> >>>
> >> >>> In any case, the last run of copy even with the high usage went
> >> >>> through
> >> >>> fine.
> >> >>>
> >> >>>
> >> >>> Best,
> >> >>> Sandeep
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Fri, Nov 15, 2013 at 2:29 AM, Koichi Suzuki <
> koi...@gm...>
> >> >>> wrote:
> >> >>>>
> >> >>>> Yes, I've already found this kind of failure.   When disk write is
> >> >>>> quite slower than read, coordinator overshoots.  However, in this
> >> >>>> case, I thought we will see that palloc() overflows.   If each
> piece
> >> >>>> of COPY TO text data is less than 1GB, then each COPY TO succeeds.
> >> >>>> If the write is much slower than read,  each coordinator will
> >> >>>> allocate
> >> >>>> memory for each COPY TO cache going to datanodes, ending up with
> >> >>>> enormous amount of memory consumption and have a good chance to run
> >> >>>> OOM killer.    Your memory usage looks enormous.    Maybe you have
> >> >>>> many COPY TO instances each of which allocates too much memory.
> >> >>>>
> >> >>>> Attached patch fixes COPY TO.   Because we have a chance to have
> >> >>>> similar overshoots in COPY FROM, I'd like to write a patch to fix
> >> >>>> this
> >> >>>> too before committing.   This is for the current master but will
> work
> >> >>>> with 1.1.x (even with 1.0.x hopefully).
> >> >>>>
> >> >>>> The patch looks to be fine.   I asked to test it in such an
> >> >>>> environment and they reported the patch fixed their issue.
> >> >>>>
> >> >>>> I'm not sure but hope this works in your case.   May reduce memory
> >> >>>> consumption drastically.   Please let me know the result.
> >> >>>>
> >> >>>> I have to allocate some of my resource to fix the same issue in
> COPY
> >> >>>> FROM.    I have another request to fix this so if the patch works
> for
> >> >>>> you, I may be able to commit only COPY TO fix first.
> >> >>>>
> >> >>>> Regards;
> >> >>>> ---
> >> >>>> Koichi Suzuki
> >> >>>>
> >> >>>>
> >> >>>> 2013/11/15 Sandeep Gupta <gup...@gm...>:
> >> >>>> > Hi Koichi,
> >> >>>> >
> >> >>>> >  I am using the copy command over the text file to populate a
> >> >>>> > table.
> >> >>>> >
> >> >>>> > Sorry for the confusion. I meant 628 -- 638 lines in the
> >> >>>> > pqformat.c.
> >> >>>> > Let me explain this a bit more. I get  error in some of the
> >> >>>> > datanode
> >> >>>> > logfiles
> >> >>>> > as "invalid string in message". The message is mentioned only in
> >> >>>> > pqformat.c
> >> >>>> > at line 628. Although I am not able to trace back as to where
> >> >>>> > error
> >> >>>> > may
> >> >>>> > have first ordered in the datanode that triggered this error.
> >> >>>> >
> >> >>>> > In order to debug this I started printing out the rows the
> >> >>>> > coordinator
> >> >>>> > is
> >> >>>> > sending to the datanode (in function DataNodeCopyIn). For some
> >> >>>> > reason
> >> >>>> > the
> >> >>>> > error invalid string in messge error has  not happened so far.
>  It
> >> >>>> > seems
> >> >>>> > there is some issue with the flow control of message between
> >> >>>> > coordinator and
> >> >>>> > datanode.
> >> >>>> > Hope this helps in narrowing down what is causing the problem in
> >> >>>> > the
> >> >>>> > first
> >> >>>> > place.
> >> >>>> >
> >> >>>> >
> >> >>>> > However, the new problem I see is that the memory usage goes very
> >> >>>> > high. Of
> >> >>>> > the 48 GB almost all of the memory is getting used. Is this
> normal?
> >> >>>> > If not can you give suggestion as to how to keep the memory
> >> >>>> > consumption
> >> >>>> > low.
> >> >>>> >
> >> >>>> > -Sandeep
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> > On Thu, Nov 14, 2013 at 8:57 PM, Koichi Suzuki
> >> >>>> > <koi...@gm...>
> >> >>>> > wrote:
> >> >>>> >>
> >> >>>> >> Could you let me know what "line 682 -- 637" means?
> >> >>>> >>
> >> >>>> >> Because any communication between coordinator and datanode is
> done
> >> >>>> >> using libpq, I don't think XC populates information using raw
> >> >>>> >> stringinfo.   In fact, from datanodes, coordinator is just a
> >> >>>> >> client
> >> >>>> >> except for supplying XID and snapshot.   However,  COPY TO uses
> >> >>>> >> send()/receive(), not through libpq, to send data to datanode.
>  In
> >> >>>> >> this case, if incoming data file contains NULL in the data, such
> >> >>>> >> data
> >> >>>> >> may be sent via send() to datanodes.   COPY TO assumes incoming
> >> >>>> >> data
> >> >>>> >> in a text format.   Do you think you have such situations?
> >> >>>> >>
> >> >>>> >> Analysis may take a bit but let me try.
> >> >>>> >>
> >> >>>> >> Regards;
> >> >>>> >> ---
> >> >>>> >> Koichi Suzuki
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> 2013/11/15 Sandeep Gupta <gup...@gm...>:
> >> >>>> >> > Hi Koichi, Masataka,
> >> >>>> >> >
> >> >>>> >> >  Thanks for taking a look. I double checked both memory usage
> >> >>>> >> > and
> >> >>>> >> > the
> >> >>>> >> > dmesg
> >> >>>> >> > log. The kill is not happening due to the out of memory. The
> >> >>>> >> > kernel
> >> >>>> >> > doesn't
> >> >>>> >> > show any processes being killed.
> >> >>>> >> >
> >> >>>> >> > My current suspicion is that something amiss is happening over
> >> >>>> >> > the
> >> >>>> >> > network.
> >> >>>> >> > This happens consistently when performing copy over large
> >> >>>> >> > datasets.
> >> >>>> >> > Few
> >> >>>> >> > of
> >> >>>> >> > datanodes end up with "invalid string in message".
> >> >>>> >> > Going over the code in pqformat.c this happens because the
> >> >>>> >> > invariant
> >> >>>> >> > "StringInfo is guaranteed to have a trailing null byte" is not
> >> >>>> >> > valid
> >> >>>> >> > (line
> >> >>>> >> > 682 -- 637). I am not sure why this happens.
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> > After this the datanodes do something illegal (probably buffer
> >> >>>> >> > overflow
> >> >>>> >> > on
> >> >>>> >> > the port..just a guess) that the OS decides to issue a
> sigkill.
> >> >>>> >> > I
> >> >>>> >> > am
> >> >>>> >> > also
> >> >>>> >> > not sure why the OS is not logging this killed operation.
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> > Please let me know of  suggestions you may have regarding the
> >> >>>> >> > "invalid
> >> >>>> >> > string in message". In particular, where in the code is
> >> >>>> >> > StringInfo
> >> >>>> >> > msg
> >> >>>> >> > is
> >> >>>> >> > getting populated i.e. which code reads the message over the
> >> >>>> >> > network and
> >> >>>> >> > keeps it ready for the copy operation on the datanode. Any
> other
> >> >>>> >> > suggestion
> >> >>>> >> > would also be helpful to trace this bug.
> >> >>>> >> >
> >> >>>> >> > Best,
> >> >>>> >> > Sandeep
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> > On Thu, Nov 14, 2013 at 10:25 AM, Masataka Saito
> >> >>>> >> > <pg...@gm...>
> >> >>>> >> > wrote:
> >> >>>> >> >>
> >> >>>> >> >> Hi,
> >> >>>> >> >>
> >> >>>> >> >> I think I may be onto something.
> >> >>>> >> >>
> >> >>>> >> >> When the memory is exhausted, the Linux kernel kills process
> >> >>>> >> >> randomly
> >> >>>> >> >> by SIGKILL.
> >> >>>> >> >> The mechanism is called OOM(Out Of Memory) killer.
> >> >>>> >> >> OOM killer logs its activity to kernel mesage buffer.
> >> >>>> >> >>
> >> >>>> >> >> Could you check display message (dmesg) and memory resource?
> >> >>>> >> >>
> >> >>>> >> >> Regards.
> >> >>>> >> >>
> >> >>>> >> >> On Tue, Nov 12, 2013 at 3:55 AM, Sandeep Gupta
> >> >>>> >> >> <gup...@gm...>
> >> >>>> >> >> wrote:
> >> >>>> >> >> > Hi Koichi,
> >> >>>> >> >> >
> >> >>>> >> >> >    It is a bit of mystery  because it does not happen
> >> >>>> >> >> > consistently.
> >> >>>> >> >> > Thanks
> >> >>>> >> >> > for the clarification though. It is indeed helpful.
> >> >>>> >> >> >
> >> >>>> >> >> > -Sandeep
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> > On Mon, Nov 11, 2013 at 1:58 AM, 鈴木 幸市
> >> >>>> >> >> > <ko...@in...>
> >> >>>> >> >> > wrote:
> >> >>>> >> >> >>
> >> >>>> >> >> >> Someone is sending SIGKILL to coordinator/datanode backend
> >> >>>> >> >> >> processes.
> >> >>>> >> >> >> Although XC (originally PG) code has a handler for
> SIGKILL,
> >> >>>> >> >> >> I
> >> >>>> >> >> >> didn’t
> >> >>>> >> >> >> found
> >> >>>> >> >> >> any code in XC sending SIGKILL to other processes.   I’m
> >> >>>> >> >> >> afraid
> >> >>>> >> >> >> there
> >> >>>> >> >> >> could
> >> >>>> >> >> >> be another process sending SIGKILL to them.
> >> >>>> >> >> >>
> >> >>>> >> >> >> Best Regards;
> >> >>>> >> >> >> ---
> >> >>>> >> >> >> Koichi Suzuki
> >> >>>> >> >> >>
> >> >>>> >> >> >> 2013/11/09 2:54、Sandeep Gupta <gup...@gm...>
> >> >>>> >> >> >> のメール：
> >> >>>> >> >> >>
> >> >>>> >> >> >> > I am running single instance of postgres-xc with
>  several
> >> >>>> >> >> >> > datanode
> >> >>>> >> >> >> > Each data node runs on its own machine.
> >> >>>> >> >> >> > After instantiating the cluster the database sits idle.
> I
> >> >>>> >> >> >> > do
> >> >>>> >> >> >> > not
> >> >>>> >> >> >> > perform
> >> >>>> >> >> >> > any create table or insert operations. However, after
> some
> >> >>>> >> >> >> > time
> >> >>>> >> >> >> > all
> >> >>>> >> >> >> > the
> >> >>>> >> >> >> > datanode  automatically shuts down (log messages
> >> >>>> >> >> >> > attached).
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > Any pointers as to why this maybe happening would be
> very
> >> >>>> >> >> >> > useful.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > -Sandeep
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > The log files across all datanode contain various error
> >> >>>> >> >> >> > messages.
> >> >>>> >> >> >> > These
> >> >>>> >> >> >> > include:
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > LOG:  statistics collector process (PID 23073) was
> >> >>>> >> >> >> > terminated
> >> >>>> >> >> >> > by
> >> >>>> >> >> >> > signal
> >> >>>> >> >> >> > 9: Killed
> >> >>>> >> >> >> > LOG:  WAL writer process (PID 23071) was terminated by
> >> >>>> >> >> >> > signal
> >> >>>> >> >> >> > 9:
> >> >>>> >> >> >> > Killed
> >> >>>> >> >> >> > LOG:  terminating any other active server processes
> >> >>>> >> >> >> > WARNING:  terminating connection because of crash of
> >> >>>> >> >> >> > another
> >> >>>> >> >> >> > server
> >> >>>> >> >> >> > process
> >> >>>> >> >> >> > DETAIL:  The postmaster has commanded this server
> process
> >> >>>> >> >> >> > to
> >> >>>> >> >> >> > roll
> >> >>>> >> >> >> > back
> >> >>>> >> >> >> > the current transaction and exit, because another server
> >> >>>> >> >> >> > process
> >> >>>> >> >> >> > exited
> >> >>>> >> >> >> > abnormally and possibly corrupted shared memory.
> >> >>>> >> >> >> > HINT:  In a moment you should be able to reconnect to
> the
> >> >>>> >> >> >> > database
> >> >>>> >> >> >> > and
> >> >>>> >> >> >> > repeat your command.
> >> >>>> >> >> >> > LOG:  all server processes terminated; reinitializing
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > LOG:  checkpointer process (PID 21406) was terminated by
> >> >>>> >> >> >> > signal 9:
> >> >>>> >> >> >> > Killed
> >> >>>> >> >> >> > LOG:  terminating any other active server processes
> >> >>>> >> >> >> > WARNING:  terminating connection because of crash of
> >> >>>> >> >> >> > another
> >> >>>> >> >> >> > server
> >> >>>> >> >> >> > process
> >> >>>> >> >> >> > DETAIL:  The postmaster has commanded this server
> process
> >> >>>> >> >> >> > to
> >> >>>> >> >> >> > roll
> >> >>>> >> >> >> > back
> >> >>>> >> >> >> > the current transaction and exit, because another server
> >> >>>> >> >> >> > process
> >> >>>> >> >> >> > exited
> >> >>>> >> >> >> > abnormally and possibly corrupted shared memory.
> >> >>>> >> >> >> > LOG:  statistics collector process (PID 28881) was
> >> >>>> >> >> >> > terminated
> >> >>>> >> >> >> > by
> >> >>>> >> >> >> > signal
> >> >>>> >> >> >> > 9: Killed
> >> >>>> >> >> >> > LOG:  autovacuum launcher process (PID 28880) was
> >> >>>> >> >> >> > terminated
> >> >>>> >> >> >> > by
> >> >>>> >> >> >> > signal
> >> >>>> >> >> >> > 9: Killed
> >> >>>> >> >> >> > LOG:  terminating any other active server processes
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> ------------------------------------------------------------------------------
> >> >>>> >> >> >> > November Webinars for C, C++, Fortran Developers
> >> >>>> >> >> >> > Accelerate application performance with scalable
> >> >>>> >> >> >> > programming
> >> >>>> >> >> >> > models.
> >> >>>> >> >> >> > Explore
> >> >>>> >> >> >> > techniques for threading, error checking, porting, and
> >> >>>> >> >> >> > tuning. Get
> >> >>>> >> >> >> > the
> >> >>>> >> >> >> > most
> >> >>>> >> >> >> > from the latest Intel processors and coprocessors. See
> >> >>>> >> >> >> > abstracts
> >> >>>> >> >> >> > and
> >> >>>> >> >> >> > register
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________
> >> >>>> >> >> >> > Postgres-xc-general mailing list
> >> >>>> >> >> >> > Pos...@li...
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> >> >>>> >> >> >> >
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
> >> >>>> >> >> >>
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> ------------------------------------------------------------------------------
> >> >>>> >> >> > November Webinars for C, C++, Fortran Developers
> >> >>>> >> >> > Accelerate application performance with scalable
> programming
> >> >>>> >> >> > models.
> >> >>>> >> >> > Explore
> >> >>>> >> >> > techniques for threading, error checking, porting, and
> >> >>>> >> >> > tuning.
> >> >>>> >> >> > Get
> >> >>>> >> >> > the
> >> >>>> >> >> > most
> >> >>>> >> >> > from the latest Intel processors and coprocessors. See
> >> >>>> >> >> > abstracts
> >> >>>> >> >> > and
> >> >>>> >> >> > register
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
> >> >>>> >> >> > _______________________________________________
> >> >>>> >> >> > Postgres-xc-general mailing list
> >> >>>> >> >> > Pos...@li...
> >> >>>> >> >> >
> >> >>>> >> >> >
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
> >> >>>> >> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> ------------------------------------------------------------------------------
> >> >>>> >> > DreamFactory - Open Source REST & JSON Services for HTML5 &
> >> >>>> >> > Native
> >> >>>> >> > Apps
> >> >>>> >> > OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API
> >> >>>> >> > Access
> >> >>>> >> > Free app hosting. Or install the open source package on any
> LAMP
> >> >>>> >> > server.
> >> >>>> >> > Sign up and see examples for AngularJS, jQuery, Sencha Touch
> and
> >> >>>> >> > Native!
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
> >> >>>> >> > _______________________________________________
> >> >>>> >> > Postgres-xc-general mailing list
> >> >>>> >> > Pos...@li...
> >> >>>> >> >
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
> >> >>>> >> >
> >> >>>> >
> >> >>>> >
> >> >>>
> >> >>>
> >> >>
> >> >
> >
> >
>

2010	Jan	Feb	Mar	Apr	May (2)	Jun	Jul	Aug (6)	Sep	Oct (19)	Nov (1)	Dec
2011	Jan (12)	Feb (1)	Mar (4)	Apr (4)	May (32)	Jun (12)	Jul (11)	Aug (1)	Sep (6)	Oct (3)	Nov	Dec (10)
2012	Jan (11)	Feb (1)	Mar (3)	Apr (25)	May (53)	Jun (38)	Jul (103)	Aug (54)	Sep (31)	Oct (66)	Nov (77)	Dec (20)
2013	Jan (91)	Feb (86)	Mar (103)	Apr (107)	May (25)	Jun (37)	Jul (17)	Aug (59)	Sep (38)	Oct (78)	Nov (29)	Dec (15)
2014	Jan (23)	Feb (82)	Mar (118)	Apr (101)	May (103)	Jun (45)	Jul (6)	Aug (10)	Sep	Oct (32)	Nov	Dec (9)
2015	Jan (3)	Feb (5)	Mar	Apr (1)	May	Jun	Jul (9)	Aug (4)	Sep (3)	Oct	Nov	Dec
2016	Jan (3)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun (3)	Jul	Aug	Sep	Oct	Nov	Dec
2018	Jan	Feb	Mar	Apr	May (4)	Jun	Jul	Aug	Sep	Oct	Nov	Dec

S	M	T	W	T	F	S
1	2	3	4	5 (1)	6	7
8	9 (2)	10 (3)	11 (6)	12 (2)	13 (1)	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

postgres-xc-general Mailing List for Postgres-XC

postgres-xc-general — General info and messages