Quick Links

Support for N synchronous standby servers - take 2

Lists:	pgsql-hackers

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Support for N synchronous standby servers - take 2
Date:	2015-05-15 11:55:04
Message-ID:	CAOG9ApHYCPmTypAAwfD3_V7sVOkbnECFivmRc1AxhB40ZBSwNQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

There was a discussion on support for N synchronous standby servers started
by Michael. Refer
http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
. The use of hooks and dedicated language was suggested, however, it seemed
to be an overkill for the scenario and there was no consensus on this.
Exploring GUC-land was preferred.

Please find attached a patch, built on Michael's patch from above
mentioned thread, which supports choosing different number of nodes from
each set i.e. k nodes from set 1, l nodes from set 2, so on.
The format of synchronous_standby_names has been updated to standby name
followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The
transaction waits for all the specified number of standby in each group.
Any extra nodes with the same name will be considered potential. The
special entry * for the standby name is also supported.

Thanks,

Beena Emerson

Attachment	Content-Type	Size
20150515_multiple_sync_rep.patch	application/octet-stream	21.7 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-05-15 12:18:03
Message-ID:	CAB7nPqRJbeEK5PFs0aJF049CPkwcmA4E28XF=Ecu-rhOuJSmXw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> There was a discussion on support for N synchronous standby servers started
> by Michael. Refer
> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
> . The use of hooks and dedicated language was suggested, however, it seemed
> to be an overkill for the scenario and there was no consensus on this.
> Exploring GUC-land was preferred.

Cool.

> Please find attached a patch, built on Michael's patch from above mentioned
> thread, which supports choosing different number of nodes from each set i.e.
> k nodes from set 1, l nodes from set 2, so on.
> The format of synchronous_standby_names has been updated to standby name
> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The
> transaction waits for all the specified number of standby in each group. Any
> extra nodes with the same name will be considered potential. The special
> entry * for the standby name is also supported.

I don't think that this is going in the good direction, what was
suggested mainly by Robert was to use a micro-language that would
allow far more extensibility that what you are proposing. See for
example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w(at)mail(dot)gmail(dot)com
for some ideas. IMO, before writing any patch in this area we should
find a clear consensus on what we want to do. Also, unrelated to this
patch, we should really get first the patch implementing the... Hum...
infrastructure for regression tests regarding replication and
archiving to be able to have actual tests for this feature (working on
it for next CF).

+ if (!SplitIdentifierString(standby_detail, '-', &elemlist2))
+ {
+ /* syntax error in list */
+ pfree(rawstring);
+ list_free(elemlist1);
+ return 0;
+ }
At quick glance, this looks problematic to me if application_name has an hyphen.

Regards,
--
Michael

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-05-16 08:58:35
Message-ID:	CAD21AoANuXsjSMdpT60eopMfwYNO=YCPDT=0rr7q6w-TKo0eNQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 15, 2015 at 9:18 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>> There was a discussion on support for N synchronous standby servers started
>> by Michael. Refer
>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
>> . The use of hooks and dedicated language was suggested, however, it seemed
>> to be an overkill for the scenario and there was no consensus on this.
>> Exploring GUC-land was preferred.
>
> Cool.
>
>> Please find attached a patch, built on Michael's patch from above mentioned
>> thread, which supports choosing different number of nodes from each set i.e.
>> k nodes from set 1, l nodes from set 2, so on.
>> The format of synchronous_standby_names has been updated to standby name
>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The
>> transaction waits for all the specified number of standby in each group. Any
>> extra nodes with the same name will be considered potential. The special
>> entry * for the standby name is also supported.
>
> I don't think that this is going in the good direction, what was
> suggested mainly by Robert was to use a micro-language that would
> allow far more extensibility that what you are proposing. See for
> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w(at)mail(dot)gmail(dot)com
> for some ideas. IMO, before writing any patch in this area we should
> find a clear consensus on what we want to do. Also, unrelated to this
> patch, we should really get first the patch implementing the... Hum...
> infrastructure for regression tests regarding replication and
> archiving to be able to have actual tests for this feature (working on
> it for next CF).

The dedicated language for multiple sync replication would be more
extensibility as you said, but I think there are not a lot of user who
want to or should use this.
IMHO such a dedicated extensible feature could be extension module,
i.g. contrib. And we could implement more simpler feature into
PostgreSQL core with some restriction.

Regards,

-------
Sawada Masahiko

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-05-16 13:28:29
Message-ID:	CAB7nPqRzZZUZ_yVkSyqwE7nDw1h+Lfo7g9UMjED8RGESTn0opg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, May 16, 2015 at 5:58 PM, Sawada Masahiko wrote:
> The dedicated language for multiple sync replication would be more
> extensibility as you said, but I think there are not a lot of user who
> want to or should use this.
> IMHO such a dedicated extensible feature could be extension module,
> i.g. contrib. And we could implement more simpler feature into
> PostgreSQL core with some restriction.

As proposed, this feature does not bring us really closer to quorum
commit, and AFAIK this is what we are more or less aiming at recalling
previous discussions. Particularly with the syntax proposed above, it
is not possible to do some OR conditions on subgroups of nodes, the
list of nodes is forcibly using AND because it is necessary to wait
for all the subgroups. Also, users may want to track nodes from the
same group with different application_name.
--
Michael

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-05-18 11:42:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

> I don't think that this is going in the good direction, what was
> suggested mainly by Robert was to use a micro-language that would
> allow far more extensibility that what you are proposing.

I agree, the micro-language would give far more extensibility. However, as
stated ibefore, the previous discussions concluded that GUC was a preferred
way because it is more user-friendly.

> See for
> example [hidden email]
> for some ideas. IMO, before writing any patch in this area we should
> find a clear consensus on what we want to do. Also, unrelated to this
> patch, we should really get first the patch implementing the... Hum...
> infrastructure for regression tests regarding replication and
> archiving to be able to have actual tests for this feature (working on
> it for next CF).

We could decide and work on patch for n-sync along with setting up
regression test infrastructure.

> At quick glance, this looks problematic to me if application_name has an
> hyphen.

Yes, I overlooked the fact that application name could have a hyphen. This
can be modified.

Regards,

Beena Emerson

-----

Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849711.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-05-18 11:43:19
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> As proposed, this feature does not bring us really closer to quorum
> commit, and AFAIK this is what we are more or less aiming at recalling
> previous discussions. Particularly with the syntax proposed above, it
> is not possible to do some OR conditions on subgroups of nodes, the
> list of nodes is forcibly using AND because it is necessary to wait
> for all the subgroups. Also, users may want to track nodes from the
> same group with different application_name.

The patch assumes that all standbys of a group share a name and so the "OR"
condition would be taken care of that way.
Also, since uniqueness of standby_name cannot be enforced, the same name
could be repeated across groups!.

Regards,

Beena

-----

Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849712.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-05-18 11:46:29
Message-ID:	CAB7nPqSocdBZCvMK5odTPfHW-Gmuh0J+rCzGtv1tfw1hW+56bA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, May 18, 2015 at 8:42 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> Hello,
>
>> I don't think that this is going in the good direction, what was
>> suggested mainly by Robert was to use a micro-language that would
>> allow far more extensibility that what you are proposing.
>
> I agree, the micro-language would give far more extensibility. However, as
> stated before, the previous discussions concluded that GUC was a preferred
> way because it is more user-friendly.

Er, I am not sure I follow here. The idea proposed was to define a
string formatted with some infra-language within the existing GUC
s_s_names.
--
Michael

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-05-18 13:40:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

> Er, I am not sure I follow here. The idea proposed was to define a
> string formatted with some infra-language within the existing GUC
> s_s_names.

I am sorry, I misunderstood. I thought the "language" approach meant use of
hooks and module.
As you mentioned the first step would be to reach the consensus on the
method.

If I understand correctly, s_s_names should be able to define:
- a count of sync rep from a given group of names ex : 2 from A,B,C.
- AND condition: Multiple groups and count can be defined. Ex: 1 from X,Y
AND 2 from A,B,C.

In this case, we can give the same priority to all the names specified in a
group. The standby_names cannot be repeated across groups.

Robert had also talked about a little more complex scenarios of choosing
either A or both B and C.
Additionally, preference for a standby could also be specified. Ex: among
A,B and C, A can have higher priority and would be selected if an standby
with name A is connected.
This can make the language very complicated.

Should all these scenarios be covered in the n-sync selection or can we
start with the basic 2 and then update later?

Thanks & Regards,

Beena Emerson

-----

Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849736.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-05-21 12:43:17
Message-ID:	CA+TgmoZP_pwB+tFCLjYJCrN-LOLoCStSabjBsPaQnR81QvHFzw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, May 18, 2015 at 9:40 AM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>> Er, I am not sure I follow here. The idea proposed was to define a
>> string formatted with some infra-language within the existing GUC
>> s_s_names.
>
> I am sorry, I misunderstood. I thought the "language" approach meant use of
> hooks and module.
> As you mentioned the first step would be to reach the consensus on the
> method.
>
> If I understand correctly, s_s_names should be able to define:
> - a count of sync rep from a given group of names ex : 2 from A,B,C.
> - AND condition: Multiple groups and count can be defined. Ex: 1 from X,Y
> AND 2 from A,B,C.
>
> In this case, we can give the same priority to all the names specified in a
> group. The standby_names cannot be repeated across groups.
>
> Robert had also talked about a little more complex scenarios of choosing
> either A or both B and C.
> Additionally, preference for a standby could also be specified. Ex: among
> A,B and C, A can have higher priority and would be selected if an standby
> with name A is connected.
> This can make the language very complicated.
>
> Should all these scenarios be covered in the n-sync selection or can we
> start with the basic 2 and then update later?

If it were me, I'd just go implement a scanner using flex and a parser
using bison and use that to parse the format I suggested before, or
some similar one. This may sound hard, but it's really not: I put
together the patch that became commit
878fdcb843e087cc1cdeadc987d6ef55202ddd04 in just a few hours. I don't
see why this would be particularly harder. Then instead of arguing
about whether some stop-gap implementation is good enough until we do
the real thing, we can just have the real thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-24 14:30:08
Message-ID:	CAHGQGwEyPv=EN7g=QmuVWNubxEuKGbVqca_VojepmT7aX+0_5g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Doesn't this approach prevent us from specifying the "potential" synchronous
standby server? For example, imagine the case where you want to treat
the server AAA as synchronous standby. You also want to use the server BBB
as synchronous standby only if the server AAA goes down. IOW, you want to
prefer to the server AAA as synchronous standby rather than BBB.
Currently we can easily set up that case by just setting
synchronous_standby_names as follows.

synchronous_standby_names = 'AAA, BBB'

However, after we adopt the quorum commit feature with the proposed
macro-language, how can we set up that case? It seems impossible...
I'm afraid that this might be a backward compatibility issue.

Or we should extend the proposed micro-language so that it also can handle
the priority of each standby servers? Not sure that's possible, though.

Regards,

--
Fujii Masao

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-25 03:15:06
Message-ID:	CAB7nPqSJRTn7nHYCQpa9weqxqnhuNWhdXmsN_hUDG9WtpuMgmA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 24, 2015 at 11:30 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, May 15, 2015 at 9:18 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>>> There was a discussion on support for N synchronous standby servers started
>>> by Michael. Refer
>>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
>>> . The use of hooks and dedicated language was suggested, however, it seemed
>>> to be an overkill for the scenario and there was no consensus on this.
>>> Exploring GUC-land was preferred.
>>
>> Cool.
>>
>>> Please find attached a patch, built on Michael's patch from above mentioned
>>> thread, which supports choosing different number of nodes from each set i.e.
>>> k nodes from set 1, l nodes from set 2, so on.
>>> The format of synchronous_standby_names has been updated to standby name
>>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The
>>> transaction waits for all the specified number of standby in each group. Any
>>> extra nodes with the same name will be considered potential. The special
>>> entry * for the standby name is also supported.
>>
>> I don't think that this is going in the good direction, what was
>> suggested mainly by Robert was to use a micro-language that would
>> allow far more extensibility that what you are proposing. See for
>> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w(at)mail(dot)gmail(dot)com
>> for some ideas.
>
> Doesn't this approach prevent us from specifying the "potential" synchronous
> standby server? For example, imagine the case where you want to treat
> the server AAA as synchronous standby. You also want to use the server BBB
> as synchronous standby only if the server AAA goes down. IOW, you want to
> prefer to the server AAA as synchronous standby rather than BBB.
> Currently we can easily set up that case by just setting
> synchronous_standby_names as follows.
>
> synchronous_standby_names = 'AAA, BBB'
>
> However, after we adopt the quorum commit feature with the proposed
> macro-language, how can we set up that case? It seems impossible...
> I'm afraid that this might be a backward compatibility issue.

Like that:
synchronous_standby_names = 'AAA, BBB'
The thing is that we need to support the old grammar as well to be
fully backward compatible, and that's actually equivalent to that in
the grammar: 1(AAA,BBB,CCC). This is something I understood was
included in Robert's draft proposal.

> Or we should extend the proposed micro-language so that it also can handle
> the priority of each standby servers? Not sure that's possible, though.

I am not sure that's really necessary, we need only to be able to
manage priorities within each subgroup. Putting it in a shape that
user can understand easily in pg_stat_replication looks more
challenging though. We are going to need a new view like
pg_stat_replication group that shows up the priority status of each
group, with one record for each group, taking into account that a
group can be included in another one.
--
Michael

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-25 03:57:22
Message-ID:	CAHGQGwHQh7wD_VXT0Rd67-6Y01n79PT7Fa0uxX3_brSXXYXGMw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Wed, Jun 24, 2015 at 11:30 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Fri, May 15, 2015 at 9:18 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>>>> There was a discussion on support for N synchronous standby servers started
>>>> by Michael. Refer
>>>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
>>>> . The use of hooks and dedicated language was suggested, however, it seemed
>>>> to be an overkill for the scenario and there was no consensus on this.
>>>> Exploring GUC-land was preferred.
>>>
>>> Cool.
>>>
>>>> Please find attached a patch, built on Michael's patch from above mentioned
>>>> thread, which supports choosing different number of nodes from each set i.e.
>>>> k nodes from set 1, l nodes from set 2, so on.
>>>> The format of synchronous_standby_names has been updated to standby name
>>>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The
>>>> transaction waits for all the specified number of standby in each group. Any
>>>> extra nodes with the same name will be considered potential. The special
>>>> entry * for the standby name is also supported.
>>>
>>> I don't think that this is going in the good direction, what was
>>> suggested mainly by Robert was to use a micro-language that would
>>> allow far more extensibility that what you are proposing. See for
>>> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w(at)mail(dot)gmail(dot)com
>>> for some ideas.
>>
>> Doesn't this approach prevent us from specifying the "potential" synchronous
>> standby server? For example, imagine the case where you want to treat
>> the server AAA as synchronous standby. You also want to use the server BBB
>> as synchronous standby only if the server AAA goes down. IOW, you want to
>> prefer to the server AAA as synchronous standby rather than BBB.
>> Currently we can easily set up that case by just setting
>> synchronous_standby_names as follows.
>>
>> synchronous_standby_names = 'AAA, BBB'
>>
>> However, after we adopt the quorum commit feature with the proposed
>> macro-language, how can we set up that case? It seems impossible...
>> I'm afraid that this might be a backward compatibility issue.
>
> Like that:
> synchronous_standby_names = 'AAA, BBB'
> The thing is that we need to support the old grammar as well to be
> fully backward compatible,

Yep, that's an idea. Supporting two different grammars is a bit messy, though...
If we merge the "priority" concept to the quorum commit,
that's better. But for now I have no idea about how we can do that.

> and that's actually equivalent to that in
> the grammar: 1(AAA,BBB,CCC).

I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
two servers AAA and BBB are running, the master server may return a success
of the transaction to the client just after it receives the ACK from BBB.
OTOH, in the case of AAA,BBB, that never happens. The master must wait for
the ACK from AAA to arrive before completing the transaction. And then,
if AAA goes down, BBB should become synchronous standby.

Regards,

--
Fujii Masao

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-25 04:01:04
Message-ID:	CAB7nPqSbeQnb=1pRQC_YkzhhsH3yZXH7XjVn27ddH-pfc4QMZA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>> and that's actually equivalent to that in
>> the grammar: 1(AAA,BBB,CCC).
>
> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
> two servers AAA and BBB are running, the master server may return a success
> of the transaction to the client just after it receives the ACK from BBB.
> OTOH, in the case of AAA,BBB, that never happens. The master must wait for
> the ACK from AAA to arrive before completing the transaction. And then,
> if AAA goes down, BBB should become synchronous standby.

Ah. Right. I missed your point, that's a bad day... We could have
multiple separators to define group types then:
- "()" where the order of acknowledgement does not matter
- "[]" where it does not.
You would find the old grammar with:
1[AAA,BBB,CCC]
--
Michael

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-25 11:32:28
Message-ID:	CANP8+j+7aPHjtKP+tB9dtYWSiYvni8DAibNBV6n4kCqv6nYWBQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 25 June 2015 at 05:01, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:

> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
> > On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
> >> and that's actually equivalent to that in
> >> the grammar: 1(AAA,BBB,CCC).
> >
> > I don't think that they are the same. In the case of 1(AAA,BBB,CCC),
> while
> > two servers AAA and BBB are running, the master server may return a
> success
> > of the transaction to the client just after it receives the ACK from BBB.
> > OTOH, in the case of AAA,BBB, that never happens. The master must wait
> for
> > the ACK from AAA to arrive before completing the transaction. And then,
> > if AAA goes down, BBB should become synchronous standby.
>
> Ah. Right. I missed your point, that's a bad day... We could have
> multiple separators to define group types then:
> - "()" where the order of acknowledgement does not matter
> - "[]" where it does not.
> You would find the old grammar with:
> 1[AAA,BBB,CCC]
>

Let's start with a complex, fully described use case then work out how to
specify what we want.

I'm nervous of "it would be good ifs" because we do a ton of work only to
find a design flaw.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-25 15:49:06
Message-ID:	CAD21AoCQK=1_-mTqV24juq5FfzZV-09w6sandpDVKdKAxCZUag@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 25, 2015 at 7:32 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 25 June 2015 at 05:01, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>>
>> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
>> > On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>> >> and that's actually equivalent to that in
>> >> the grammar: 1(AAA,BBB,CCC).
>> >
>> > I don't think that they are the same. In the case of 1(AAA,BBB,CCC),
>> > while
>> > two servers AAA and BBB are running, the master server may return a
>> > success
>> > of the transaction to the client just after it receives the ACK from
>> > BBB.
>> > OTOH, in the case of AAA,BBB, that never happens. The master must wait
>> > for
>> > the ACK from AAA to arrive before completing the transaction. And then,
>> > if AAA goes down, BBB should become synchronous standby.
>>
>> Ah. Right. I missed your point, that's a bad day... We could have
>> multiple separators to define group types then:
>> - "()" where the order of acknowledgement does not matter
>> - "[]" where it does not.
>> You would find the old grammar with:
>> 1[AAA,BBB,CCC]
>
> Let's start with a complex, fully described use case then work out how to
> specify what we want.
>
> I'm nervous of "it would be good ifs" because we do a ton of work only to
> find a design flaw.
>

I'm not sure specific implementation yet, but I came up with solution
for this case.

For example,
- s_s_name = '1(a, b), c, d'
The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3.
i.g, 'b' and 'c' are potential sync node, and the quorum commit is
enable only between 'a' and 'b'.

- s_s_name = 'a, 1(b,c), d'
priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3.
So the quorum commit with 'b' and 'c' will be enabled after 'a' down.

With this idea, I think that we could use conventional syntax as in the past.
Though?

Regards,

--
Sawada Masahiko

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 05:46:15
Message-ID:	CAB7nPqQ5dOBFqUL8OfKzjJA7JGf_gqZsO+c8YwWYeZPcXgeH6A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote:
> Let's start with a complex, fully described use case then work out how to
> specify what we want.

Well, one of the most simple cases where quorum commit and this
feature would be useful for is that, with 2 data centers:
- on center 1, master A and standby B
- on center 2, standby C and standby D
With the current synchronous_standby_names, what we can do now is
ensuring that one node has acknowledged the commit of master. For
example synchronous_standby_names = 'B,C,D'. But you know that :)
What this feature would allow use to do is for example being able to
ensure that a node on the data center 2 has acknowledged the commit of
master, meaning that even if data center 1 completely lost for a
reason or another we have at least one node on center 2 that has lost
no data at transaction commit.

Now, regarding the way to express that, we need to use a concept of
node group for each element of synchronous_standby_names. A group
contains a set of elements, each element being a group or a single
node. And for each group we need to know three things when a commit
needs to be acknowledged:
- Does my group need to acknowledge the commit?
- If yes, how many elements in my group need to acknowledge it?
- Does the order of my elements matter?

That's where the micro-language idea makes sense to use. For example,
we can define a group using separators and like (elt1,...eltN) or
[elt1,elt2,eltN]. Appending a number in front of a group is essential
as well for quorum commits. Hence for example, assuming that '()' is
used for a group whose element order does not matter, if we use that:
- k(elt1,elt2,eltN) means that we need for the k elements in the set
to return true (aka commit confirmation).
- k[elt1,elt2,eltN] means that we need for the first k elements in the
set to return true.

When k is not defined for a group, k = 1. Using only elements
separated by commas for the upper group means that we wait for the
first element in the set (for backward compatibility), hence:
1(elt1,elt2,eltN) <=> elt1,elt2,eltN

We could as well mix each behavior, aka being able to define for a
group to wait for the first k elements and a total of j elements in
the whole set, but I don't think that we need to go that far. I
suspect that in most cases users will be satisfied with only cases
where there is a group of data centers, and they want to be sure that
one or two in each center has acknowledged a commit to master
(performance is not the matter here if centers are not close). Hence
in the case above, you could get the behavior wanted with this
definition:
2(B,(C,D))
With more data centers, like 3 (wait for two nodes in the 3rd set):
3(B,(C,D),2(E,F,G))
Users could define more levels of group, like that:
2(A,(B,(C,D)))
But that's actually something few people would do in real cases.

> I'm nervous of "it would be good ifs" because we do a ton of work only to
> find a design flaw.

That makes sense. Let's continue arguing on it then.
--
Michael

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 05:59:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2015-06-26 AM 12:49, Sawada Masahiko wrote:
> On Thu, Jun 25, 2015 at 7:32 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>
>> Let's start with a complex, fully described use case then work out how to
>> specify what we want.
>>
>> I'm nervous of "it would be good ifs" because we do a ton of work only to
>> find a design flaw.
>>
>
> I'm not sure specific implementation yet, but I came up with solution
> for this case.
>
> For example,
> - s_s_name = '1(a, b), c, d'
> The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3.
> i.g, 'b' and 'c' are potential sync node, and the quorum commit is
> enable only between 'a' and 'b'.
>
> - s_s_name = 'a, 1(b,c), d'
> priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3.
> So the quorum commit with 'b' and 'c' will be enabled after 'a' down.
>

Do we really need to add a number like '1' in '1(a, b), c, d'?

The order of writing names already implies priorities like 2 & 3 for c & d,
respectively, like in your example. Having to write '1' for the group '(a, b)'
seems unnecessary, IMHO. Sorry if I have missed any previous discussion where
its necessity was discussed.

So, the order of writing standby names in the list should declare their
relative priorities and parentheses (possibly nested) should help inform about
the grouping (for quorum?)

Thanks,
Amit

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 06:05:52
Message-ID:	CAB7nPqT-Z5SEom2KDwK9Ja3qXmq26=TbX+AnW0L_n0FCXq-xzA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 26, 2015 at 2:59 PM, Amit Langote wrote:
> Do we really need to add a number like '1' in '1(a, b), c, d'?
> The order of writing names already implies priorities like 2 & 3 for c & d,
> respectively, like in your example. Having to write '1' for the group '(a, b)'
> seems unnecessary, IMHO. Sorry if I have missed any previous discussion where
> its necessity was discussed.

'1' is implied if no number is specified. That's the idea as written
here, not something decided of course :)

> So, the order of writing standby names in the list should declare their
> relative priorities and parentheses (possibly nested) should help inform about
> the grouping (for quorum?)

Yes.
--
Michael

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 06:06:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-06-26 PM 02:59, Amit Langote wrote:
> On 2015-06-26 AM 12:49, Sawada Masahiko wrote:
>>
>> For example,
>> - s_s_name = '1(a, b), c, d'
>> The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3.
>> i.g, 'b' and 'c' are potential sync node, and the quorum commit is
>> enable only between 'a' and 'b'.
>>
>> - s_s_name = 'a, 1(b,c), d'
>> priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3.
>> So the quorum commit with 'b' and 'c' will be enabled after 'a' down.
>>
>
> Do we really need to add a number like '1' in '1(a, b), c, d'?
>
> The order of writing names already implies priorities like 2 & 3 for c & d,
> respectively, like in your example. Having to write '1' for the group '(a, b)'
> seems unnecessary, IMHO. Sorry if I have missed any previous discussion where
> its necessity was discussed.
>
> So, the order of writing standby names in the list should declare their
> relative priorities and parentheses (possibly nested) should help inform about
> the grouping (for quorum?)
>

Oh, I missed Michael's latest message that describes its necessity. So, the
number is essentially the quorum for a group.

Sorry about the noise.

Thanks,
Amit

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 08:04:43
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2015-06-25 PM 01:01, Michael Paquier wrote:
> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
>> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>>> and that's actually equivalent to that in
>>> the grammar: 1(AAA,BBB,CCC).
>>
>> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
>> two servers AAA and BBB are running, the master server may return a success
>> of the transaction to the client just after it receives the ACK from BBB.
>> OTOH, in the case of AAA,BBB, that never happens. The master must wait for
>> the ACK from AAA to arrive before completing the transaction. And then,
>> if AAA goes down, BBB should become synchronous standby.
>
> Ah. Right. I missed your point, that's a bad day... We could have
> multiple separators to define group types then:
> - "()" where the order of acknowledgement does not matter
> - "[]" where it does not.

For '[]', I guess you meant "where it does."

> You would find the old grammar with:
> 1[AAA,BBB,CCC]
>

Thanks,
Amit

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 08:18:38
Message-ID:	CAB7nPqQe1oEsY=Yus0iCyhNPQTmYUL-XDe-GyEfknQoc27D93w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 26, 2015 at 5:04 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
> Hi,
>
> On 2015-06-25 PM 01:01, Michael Paquier wrote:
>> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
>>> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>>>> and that's actually equivalent to that in
>>>> the grammar: 1(AAA,BBB,CCC).
>>>
>>> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
>>> two servers AAA and BBB are running, the master server may return a success
>>> of the transaction to the client just after it receives the ACK from BBB.
>>> OTOH, in the case of AAA,BBB, that never happens. The master must wait for
>>> the ACK from AAA to arrive before completing the transaction. And then,
>>> if AAA goes down, BBB should become synchronous standby.
>>
>> Ah. Right. I missed your point, that's a bad day... We could have
>> multiple separators to define group types then:
>> - "()" where the order of acknowledgement does not matter
>> - "[]" where it does not.
>
> For '[]', I guess you meant "where it does."

Yes, thanks :p
--
Michael

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 16:42:22
Message-ID:	CA+TgmoaLjY7Orh3PpOZCM4NLFtzxc-eaR98XEbGwcMYK1ZWkdg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 26, 2015 at 1:46 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote:
>> Let's start with a complex, fully described use case then work out how to
>> specify what we want.
>
> Well, one of the most simple cases where quorum commit and this
> feature would be useful for is that, with 2 data centers:
> - on center 1, master A and standby B
> - on center 2, standby C and standby D
> With the current synchronous_standby_names, what we can do now is
> ensuring that one node has acknowledged the commit of master. For
> example synchronous_standby_names = 'B,C,D'. But you know that :)
> What this feature would allow use to do is for example being able to
> ensure that a node on the data center 2 has acknowledged the commit of
> master, meaning that even if data center 1 completely lost for a
> reason or another we have at least one node on center 2 that has lost
> no data at transaction commit.
>
> Now, regarding the way to express that, we need to use a concept of
> node group for each element of synchronous_standby_names. A group
> contains a set of elements, each element being a group or a single
> node. And for each group we need to know three things when a commit
> needs to be acknowledged:
> - Does my group need to acknowledge the commit?
> - If yes, how many elements in my group need to acknowledge it?
> - Does the order of my elements matter?
>
> That's where the micro-language idea makes sense to use. For example,
> we can define a group using separators and like (elt1,...eltN) or
> [elt1,elt2,eltN]. Appending a number in front of a group is essential
> as well for quorum commits. Hence for example, assuming that '()' is
> used for a group whose element order does not matter, if we use that:
> - k(elt1,elt2,eltN) means that we need for the k elements in the set
> to return true (aka commit confirmation).
> - k[elt1,elt2,eltN] means that we need for the first k elements in the
> set to return true.
>
> When k is not defined for a group, k = 1. Using only elements
> separated by commas for the upper group means that we wait for the
> first element in the set (for backward compatibility), hence:
> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN

Nice design.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 17:12:22
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06/26/2015 09:42 AM, Robert Haas wrote:
> On Fri, Jun 26, 2015 at 1:46 AM, Michael Paquier
>> That's where the micro-language idea makes sense to use. For example,
>> we can define a group using separators and like (elt1,...eltN) or
>> [elt1,elt2,eltN]. Appending a number in front of a group is essential
>> as well for quorum commits. Hence for example, assuming that '()' is
>> used for a group whose element order does not matter, if we use that:
>> - k(elt1,elt2,eltN) means that we need for the k elements in the set
>> to return true (aka commit confirmation).
>> - k[elt1,elt2,eltN] means that we need for the first k elements in the
>> set to return true.
>>
>> When k is not defined for a group, k = 1. Using only elements
>> separated by commas for the upper group means that we wait for the
>> first element in the set (for backward compatibility), hence:
>> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN

This really feels like we're going way beyond what we want a single
string GUC. I feel that this feature, as outlined, is a terrible hack
which we will regret supporting in the future. You're taking something
which was already a fast hack because we weren't sure if anyone would
use it, and building two levels on top of that.

If we're going to do quorum, multi-set synchrep, then we need to have a
real management interface. Like, we really ought to have a system
catalog and some built in functions to manage this instead, e.g.

pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC)

pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5')

pg_modify_sync_set(quorum INT, set_members VARIADIC)

pg_drop_synch_set(set_name NAME)

For users who want the new functionality, they just set
synchronous_standby_names='catalog' in pg.conf.

Having a function interface for this would make it worlds easier for the
DBA to reconfigure in order to accomodate network changes as well.
Let's face it, a DBA with three synch sets in different geos is NOT
going to want to edit pg.conf by hand and reload when the link to Brazil
goes down. That's a really sucky workflow, and near-impossible to automate.

We'll also want a new system view, pg_stat_syncrep:

pg_stat_synchrep
standby_name
client_addr
replication_status
synch_set
synch_quorum
synch_status

Alternately, we could overload those columns onto pg_stat_replication,
but that seems messy.

Finally, while I'm raining on everyone's parade: the mechanism of
identifying synchronous replicas by setting the application_name on the
replica is confusing and error-prone; if we're building out synchronous
replication into a sophisticated system, we ought to think about
replacing it.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 18:32:35
Message-ID:	CA+TgmoaLG6pqGUiUgaidgNhGDSKRyHieBg5A60wiV0EguqAW1g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 26, 2015 at 1:12 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> This really feels like we're going way beyond what we want a single
> string GUC. I feel that this feature, as outlined, is a terrible hack
> which we will regret supporting in the future. You're taking something
> which was already a fast hack because we weren't sure if anyone would
> use it, and building two levels on top of that.
>
> If we're going to do quorum, multi-set synchrep, then we need to have a
> real management interface. Like, we really ought to have a system
> catalog and some built in functions to manage this instead, e.g.
>
> pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC)
>
> pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5')
>
> pg_modify_sync_set(quorum INT, set_members VARIADIC)
>
> pg_drop_synch_set(set_name NAME)
>
> For users who want the new functionality, they just set
> synchronous_standby_names='catalog' in pg.conf.
>
> Having a function interface for this would make it worlds easier for the
> DBA to reconfigure in order to accomodate network changes as well.
> Let's face it, a DBA with three synch sets in different geos is NOT
> going to want to edit pg.conf by hand and reload when the link to Brazil
> goes down. That's a really sucky workflow, and near-impossible to automate.

I think your proposal is worth considering, but you would need to fill
in a lot more details and explain how it works in detail, rather than
just via a set of example function calls. The GUC-based syntax
proposal covers cases like multi-level rules and, now, prioritization,
and it's not clear how those would be reflected in what you propose.

> Finally, while I'm raining on everyone's parade: the mechanism of
> identifying synchronous replicas by setting the application_name on the
> replica is confusing and error-prone; if we're building out synchronous
> replication into a sophisticated system, we ought to think about
> replacing it.

I'm not averse to replacing it with something we all agree is better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-26 18:53:43
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06/26/2015 11:32 AM, Robert Haas wrote:
> I think your proposal is worth considering, but you would need to fill
> in a lot more details and explain how it works in detail, rather than
> just via a set of example function calls. The GUC-based syntax
> proposal covers cases like multi-level rules and, now, prioritization,
> and it's not clear how those would be reflected in what you propose.

So what I'm seeing from the current proposal is:

1. we have several defined synchronous sets
2. each set requires a quorum of k (defined per set)
3. within each set, replicas are arranged in priority order.

One thing which the proposal does not implement is *names* for
synchronous sets. I would also suggest that if I lose this battle and
we decide to go with a single stringy GUC, that we at least use JSON
instead of defining our out, proprietary, syntax?

Point 3. also seems kind of vaguely defined. Are we still relying on
the idea that multiple servers have the same application_name to make
them equal, and that anything else is a proritization? That is, if we have:

replica1: appname=group1
replica2: appname=group2
replica3: appname=group1
replica4: appname=group2
replica5: appname=group1
replica6: appname=group2

And the definition:

synchset: A
quorum: 2
members: [ group1, group2 ]

Then the desired behavior would be: we must get acks from at least 2
servers in group1, but if group1 isn't responding, then from group2?

What if *one* server in group1 responds? What do we do? Do we fail the
whole group and try for 2 out of 3 in group2? Or do we only need one in
group2? In which case, what prioritization is there? Who could
possibly use anything so complex?

I'm personally not convinced that quorum and prioritization are
compatible. I suggest instead that quorum and prioritization should be
exclusive alternatives, that is that a synch set should be either a
quorum set (with all members as equals) or a prioritization set (if rep1
fails, try rep2). I can imagine use cases for either mode, but not one
which would involve doing both together.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-28 08:52:36
Message-ID:	CAD21AoB2=UxR98-p_j2vghPczgsBpEXpZvW5AKp+x2zQXzzU0Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 26, 2015 at 2:46 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote:
>> Let's start with a complex, fully described use case then work out how to
>> specify what we want.
>
> Well, one of the most simple cases where quorum commit and this
> feature would be useful for is that, with 2 data centers:
> - on center 1, master A and standby B
> - on center 2, standby C and standby D
> With the current synchronous_standby_names, what we can do now is
> ensuring that one node has acknowledged the commit of master. For
> example synchronous_standby_names = 'B,C,D'. But you know that :)
> What this feature would allow use to do is for example being able to
> ensure that a node on the data center 2 has acknowledged the commit of
> master, meaning that even if data center 1 completely lost for a
> reason or another we have at least one node on center 2 that has lost
> no data at transaction commit.
>
> Now, regarding the way to express that, we need to use a concept of
> node group for each element of synchronous_standby_names. A group
> contains a set of elements, each element being a group or a single
> node. And for each group we need to know three things when a commit
> needs to be acknowledged:
> - Does my group need to acknowledge the commit?
> - If yes, how many elements in my group need to acknowledge it?
> - Does the order of my elements matter?
>
> That's where the micro-language idea makes sense to use. For example,
> we can define a group using separators and like (elt1,...eltN) or
> [elt1,elt2,eltN]. Appending a number in front of a group is essential
> as well for quorum commits. Hence for example, assuming that '()' is
> used for a group whose element order does not matter, if we use that:
> - k(elt1,elt2,eltN) means that we need for the k elements in the set
> to return true (aka commit confirmation).
> - k[elt1,elt2,eltN] means that we need for the first k elements in the
> set to return true.
>
> When k is not defined for a group, k = 1. Using only elements
> separated by commas for the upper group means that we wait for the
> first element in the set (for backward compatibility), hence:
> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN
>

I think that you meant "1[elt1,elt2,eltN] <=> elt1,elt2,eltN" in this
case (for backward compatibility), right?

Regards,

--
Sawada Masahiko

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-28 10:30:48
Message-ID:	CAB7nPqR_ahU+f-GCnxGDrmwSoHGcwoTjk++mUBpGba+YiG-qDg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 28, 2015 at 5:52 PM, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Fri, Jun 26, 2015 at 2:46 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote:
>>> Let's start with a complex, fully described use case then work out how to
>>> specify what we want.
>>
>> Well, one of the most simple cases where quorum commit and this
>> feature would be useful for is that, with 2 data centers:
>> - on center 1, master A and standby B
>> - on center 2, standby C and standby D
>> With the current synchronous_standby_names, what we can do now is
>> ensuring that one node has acknowledged the commit of master. For
>> example synchronous_standby_names = 'B,C,D'. But you know that :)
>> What this feature would allow use to do is for example being able to
>> ensure that a node on the data center 2 has acknowledged the commit of
>> master, meaning that even if data center 1 completely lost for a
>> reason or another we have at least one node on center 2 that has lost
>> no data at transaction commit.
>>
>> Now, regarding the way to express that, we need to use a concept of
>> node group for each element of synchronous_standby_names. A group
>> contains a set of elements, each element being a group or a single
>> node. And for each group we need to know three things when a commit
>> needs to be acknowledged:
>> - Does my group need to acknowledge the commit?
>> - If yes, how many elements in my group need to acknowledge it?
>> - Does the order of my elements matter?
>>
>> That's where the micro-language idea makes sense to use. For example,
>> we can define a group using separators and like (elt1,...eltN) or
>> [elt1,elt2,eltN]. Appending a number in front of a group is essential
>> as well for quorum commits. Hence for example, assuming that '()' is
>> used for a group whose element order does not matter, if we use that:
>> - k(elt1,elt2,eltN) means that we need for the k elements in the set
>> to return true (aka commit confirmation).
>> - k[elt1,elt2,eltN] means that we need for the first k elements in the
>> set to return true.
>>
>> When k is not defined for a group, k = 1. Using only elements
>> separated by commas for the upper group means that we wait for the
>> first element in the set (for backward compatibility), hence:
>> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN
>>
>
> I think that you meant "1[elt1,elt2,eltN] <=> elt1,elt2,eltN" in this
> case (for backward compatibility), right?

Yes, [] is where the order of items matter. Thanks for the correction.
Still we could do the opposite, there is nothing decided here.
--
Michael

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-28 11:36:29
Message-ID:	CAD21AoAGubHmH5dSE8waYjjYeo+hpZZ1iwenS1D9n8du3xHD3w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 06/26/2015 11:32 AM, Robert Haas wrote:
>> I think your proposal is worth considering, but you would need to fill
>> in a lot more details and explain how it works in detail, rather than
>> just via a set of example function calls. The GUC-based syntax
>> proposal covers cases like multi-level rules and, now, prioritization,
>> and it's not clear how those would be reflected in what you propose.
>
> So what I'm seeing from the current proposal is:
>
> 1. we have several defined synchronous sets
> 2. each set requires a quorum of k (defined per set)
> 3. within each set, replicas are arranged in priority order.
>
> One thing which the proposal does not implement is *names* for
> synchronous sets. I would also suggest that if I lose this battle and
> we decide to go with a single stringy GUC, that we at least use JSON
> instead of defining our out, proprietary, syntax?

JSON would be more flexible for making synchronous set, but it will
make us to change how to parse configuration file to enable a value
contains newline.

> Point 3. also seems kind of vaguely defined. Are we still relying on
> the idea that multiple servers have the same application_name to make
> them equal, and that anything else is a proritization? That is, if we have:

Yep, I guess that the same application name servers have same
priority, and the servers in same set have same priority.
(The set means here that bunch of application name in GUC).

> replica1: appname=group1
> replica2: appname=group2
> replica3: appname=group1
> replica4: appname=group2
> replica5: appname=group1
> replica6: appname=group2
>
> And the definition:
>
> synchset: A
> quorum: 2
> members: [ group1, group2 ]
>
> Then the desired behavior would be: we must get acks from at least 2
> servers in group1, but if group1 isn't responding, then from group2?

In this case, If we want to use quorum commit (i.g., all replica have
same priority),
I guess that we must get ack from 2 *elements* in listed (both group1
and group2).
If quorumm = 1, we must get ack from either group1 or group2.

> What if *one* server in group1 responds? What do we do? Do we fail the
> whole group and try for 2 out of 3 in group2? Or do we only need one in
> group2? In which case, what prioritization is there? Who could
> possibly use anything so complex?

If some servers have same application name, the master server will get
each different ack(write, flush LSN) from
same application name servers. We can use the lowest LSN of them to
release backend waiters, for more safety.
But if only one server in group1 returns ack to the master server, and
other two servers are not working,
I guess the master server can use it because other servers is invalid server.
That is, we must get ack at least 1 from each group1 and group2.

> I'm personally not convinced that quorum and prioritization are
> compatible. I suggest instead that quorum and prioritization should be
> exclusive alternatives, that is that a synch set should be either a
> quorum set (with all members as equals) or a prioritization set (if rep1
> fails, try rep2). I can imagine use cases for either mode, but not one
> which would involve doing both together.
>

Yep, separating the GUC parameter between prioritization and quorum
could be also good idea.

Also I think that we must enable us to decide which server we should
promote when the master server is down.

Regards,

--
Sawada Masahiko

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-28 12:11:38
Message-ID:	CAB7nPqSaucq8mMsrEh7kn7Ha4q5XKvBzmb-+TLFqKNaOoAO77w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 27, 2015 at 2:12 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Finally, while I'm raining on everyone's parade: the mechanism of
> identifying synchronous replicas by setting the application_name on the
> replica is confusing and error-prone; if we're building out synchronous
> replication into a sophisticated system, we ought to think about
> replacing it.

I assume that you do not refer to a new parameter in the connection
string like node_name, no? Are you referring to an extension of
START_REPLICATION in the replication protocol to pass an ID?
--
Michael

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-28 19:20:05
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06/28/2015 04:36 AM, Sawada Masahiko wrote:
> On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 06/26/2015 11:32 AM, Robert Haas wrote:
>>> I think your proposal is worth considering, but you would need to fill
>>> in a lot more details and explain how it works in detail, rather than
>>> just via a set of example function calls. The GUC-based syntax
>>> proposal covers cases like multi-level rules and, now, prioritization,
>>> and it's not clear how those would be reflected in what you propose.
>>
>> So what I'm seeing from the current proposal is:
>>
>> 1. we have several defined synchronous sets
>> 2. each set requires a quorum of k (defined per set)
>> 3. within each set, replicas are arranged in priority order.
>>
>> One thing which the proposal does not implement is *names* for
>> synchronous sets. I would also suggest that if I lose this battle and
>> we decide to go with a single stringy GUC, that we at least use JSON
>> instead of defining our out, proprietary, syntax?
>
> JSON would be more flexible for making synchronous set, but it will
> make us to change how to parse configuration file to enable a value
> contains newline.

Right. Well, another reason we should be using a system catalog and not
a single GUC ...

> In this case, If we want to use quorum commit (i.g., all replica have
> same priority),
> I guess that we must get ack from 2 *elements* in listed (both group1
> and group2).
> If quorumm = 1, we must get ack from either group1 or group2.

In that case, then priority among quorum groups is pretty meaningless,
isn't it?

>> I'm personally not convinced that quorum and prioritization are
>> compatible. I suggest instead that quorum and prioritization should be
>> exclusive alternatives, that is that a synch set should be either a
>> quorum set (with all members as equals) or a prioritization set (if rep1
>> fails, try rep2). I can imagine use cases for either mode, but not one
>> which would involve doing both together.
>>
>
> Yep, separating the GUC parameter between prioritization and quorum
> could be also good idea.

We're agreed, then ...

> Also I think that we must enable us to decide which server we should
> promote when the master server is down.

Yes, and probably my biggest issue with this patch is that it makes
deciding which server to fail over to *more* difficult (by adding more
synchronous options) without giving the DBA any more tools to decide how
to fail over. Aside from "because we said we'd eventually do it", what
real-world problem are we solving with this patch?

I'm serious. Only if we define the real reliability/availability
problem we want to solve can we decide if the new feature solves it.
I've seen a lot of technical discussion about the syntax for the
proposed GUC, and zilch about what's going to happen when the master
fails, or who the target audience for this feature is.

On 06/28/2015 05:11 AM, Michael Paquier wrote:> On Sat, Jun 27, 2015 at
2:12 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Finally, while I'm raining on everyone's parade: the mechanism of
>> identifying synchronous replicas by setting the application_name on the
>> replica is confusing and error-prone; if we're building out synchronous
>> replication into a sophisticated system, we ought to think about
>> replacing it.
>
> I assume that you do not refer to a new parameter in the connection
> string like node_name, no? Are you referring to an extension of
> START_REPLICATION in the replication protocol to pass an ID?

Well, if I had my druthers, we'd have a way to map client_addr (or
replica IDs, which would be better, in case of network proxying) *on the
master* to synchronous standby roles. Synch roles should be defined on
the master, not on the replica, because it's the master which is going
to stop accepting writes if they've been defined incorrectly.

It's always been a problem that one can accomplish a de-facto
denial-of-service by joining a cluster using the same application_name
as the synch standby, moreso because it's far too easy to do that
accidentally. One needs to simply make the mistake of copying
recovery.conf from the synch replica instead of the async replica, and
you've created a reliability problem.

Also, the fact that we use application_name for synch_standby groups
prevents us from giving the standbys in the group their own names for
identification purposes. It's only the fact that synchronous groups are
relatively useless in the current feature set that's prevented this from
being a real operational problem; if we implement quorum commit, then
users are going to want to use groups more often and will want to
identify the members of the group, and not just by IP address.

We *really* should have discussed this feature at PGCon.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-29 08:01:10
Message-ID:	CAB7nPqQdS7wmPVXqJxF7ZgTM0L-mxM0-ohadL7=e0+UjjpsJGw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 06/28/2015 04:36 AM, Sawada Masahiko wrote:
>> On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> On 06/26/2015 11:32 AM, Robert Haas wrote:
>>>> I think your proposal is worth considering, but you would need to fill
>>>> in a lot more details and explain how it works in detail, rather than
>>>> just via a set of example function calls. The GUC-based syntax
>>>> proposal covers cases like multi-level rules and, now, prioritization,
>>>> and it's not clear how those would be reflected in what you propose.
>>>
>>> So what I'm seeing from the current proposal is:
>>>
>>> 1. we have several defined synchronous sets
>>> 2. each set requires a quorum of k (defined per set)
>>> 3. within each set, replicas are arranged in priority order.
>>>
>>> One thing which the proposal does not implement is *names* for
>>> synchronous sets. I would also suggest that if I lose this battle and
>>> we decide to go with a single stringy GUC, that we at least use JSON
>>> instead of defining our out, proprietary, syntax?
>>
>> JSON would be more flexible for making synchronous set, but it will
>> make us to change how to parse configuration file to enable a value
>> contains newline.
>
> Right. Well, another reason we should be using a system catalog and not
> a single GUC ...

I assume that this takes into account the fact that you will still
need a SIGHUP to reload properly the new node information from those
catalogs and to track if some information has been modified or not.
And the fact that a connection to those catalogs will be needed as
well, something that we don't have now. Another barrier to the catalog
approach is that catalogs get replicated to the standbys, and I think
that we want to avoid that. But perhaps you simply meant having an SQL
interface with some metadata, right? Perhaps I got confused by the
word 'catalog'.

>>> I'm personally not convinced that quorum and prioritization are
>>> compatible. I suggest instead that quorum and prioritization should be
>>> exclusive alternatives, that is that a synch set should be either a
>>> quorum set (with all members as equals) or a prioritization set (if rep1
>>> fails, try rep2). I can imagine use cases for either mode, but not one
>>> which would involve doing both together.
>>>
>>
>> Yep, separating the GUC parameter between prioritization and quorum
>> could be also good idea.
>
> We're agreed, then ...

Er, I disagree here. Being able to get prioritization and quorum
working together is a requirement of this feature in my opinion. Using
again the example above with 2 data centers, being able to define a
prioritization set on the set of nodes of data center 1, and a quorum
set in data center 2 would reduce failure probability by being able to
prevent problems where for example one or more nodes lag behind
(improving performance at the same time).

>> Also I think that we must enable us to decide which server we should
>> promote when the master server is down.
>
> Yes, and probably my biggest issue with this patch is that it makes
> deciding which server to fail over to *more* difficult (by adding more
> synchronous options) without giving the DBA any more tools to decide how
> to fail over. Aside from "because we said we'd eventually do it", what
> real-world problem are we solving with this patch?

Hm. This patch needs to be coupled with improvements to
pg_stat_replication to be able to represent a node tree by basically
adding to which group a node is assigned. I can draft that if needed,
I am just a bit too lazy now...

Honestly, this is not a matter of tooling. Even today if a DBA wants
to change s_s_names without touching postgresql.conf you could just
run ALTER SYSTEM and then reload parameters.

> It's always been a problem that one can accomplish a de-facto
> denial-of-service by joining a cluster using the same application_name
> as the synch standby, moreso because it's far too easy to do that
> accidentally. One needs to simply make the mistake of copying
> recovery.conf from the synch replica instead of the async replica, and
> you've created a reliability problem.

That's a scripting problem then. There are many ways to do a false
manipulation in this area when setting up a standby. application_name
value is one, you can do worse by pointing to an incorrect IP as well,
miss a firewall filter or point to an incorrect port.

> Also, the fact that we use application_name for synch_standby groups
> prevents us from giving the standbys in the group their own names for
> identification purposes. It's only the fact that synchronous groups are
> relatively useless in the current feature set that's prevented this from
> being a real operational problem; if we implement quorum commit, then
> users are going to want to use groups more often and will want to
> identify the members of the group, and not just by IP address.

Managing groups in the synchronous protocol is adding one level of
complexity for the operator, while what I had in mind first was to
allow a user to be able to pass to the server a formula that decides
if synchronous_commit is validated or not. In any case this feels like
a different feature thinking of it now.

> We *really* should have discussed this feature at PGCon.

What is done is done. Sawada-san and I have met last weekend, and we
agreed to get a clear image of a spec for this features on this thread
before doing any coding. So let's continue the discussion..
--
Michael

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-06-29 17:40:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06/29/2015 01:01 AM, Michael Paquier wrote:
> On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

>> Right. Well, another reason we should be using a system catalog and not
>> a single GUC ...
>
> I assume that this takes into account the fact that you will still
> need a SIGHUP to reload properly the new node information from those
> catalogs and to track if some information has been modified or not.

Well, my hope was NOT to need a sighup, which is something I see as a
failing of the current system.

> And the fact that a connection to those catalogs will be needed as
> well, something that we don't have now.

Hmmm? I was envisioning the catalog being used as one on the master.
Why do we need an additional connection for that? Don't we already need
a connection in order to update pg_stat_replication?

> Another barrier to the catalog
> approach is that catalogs get replicated to the standbys, and I think
> that we want to avoid that.

Yeah, it occurred to me that that approach has its downside as well as
an upside. For example, you wouldn't want a failed-over new master to
synchrep to itself. Mostly, I was looking for something reactive,
relational, and validated, instead of passing an unvalidated string to
pg.conf and hoping that it's accepted on reload. Also some kind of
catalog approach would permit incremental changes to the config instead
of wholesale replacement.

> But perhaps you simply meant having an SQL
> interface with some metadata, right? Perhaps I got confused by the
> word 'catalog'.

No, that doesn't make any sense.

>>>> I'm personally not convinced that quorum and prioritization are
>>>> compatible. I suggest instead that quorum and prioritization should be
>>>> exclusive alternatives, that is that a synch set should be either a
>>>> quorum set (with all members as equals) or a prioritization set (if rep1
>>>> fails, try rep2). I can imagine use cases for either mode, but not one
>>>> which would involve doing both together.
>>>>
>>>
>>> Yep, separating the GUC parameter between prioritization and quorum
>>> could be also good idea.
>>
>> We're agreed, then ...
>
> Er, I disagree here. Being able to get prioritization and quorum
> working together is a requirement of this feature in my opinion. Using
> again the example above with 2 data centers, being able to define a
> prioritization set on the set of nodes of data center 1, and a quorum
> set in data center 2 would reduce failure probability by being able to
> prevent problems where for example one or more nodes lag behind
> (improving performance at the same time).

Well, then *someone* needs to define the desired behavior for all
permutations of prioritized synch sets. If it's undefined, then we're
far worse off than we are now.

>>> Also I think that we must enable us to decide which server we should
>>> promote when the master server is down.
>>
>> Yes, and probably my biggest issue with this patch is that it makes
>> deciding which server to fail over to *more* difficult (by adding more
>> synchronous options) without giving the DBA any more tools to decide how
>> to fail over. Aside from "because we said we'd eventually do it", what
>> real-world problem are we solving with this patch?
>
> Hm. This patch needs to be coupled with improvements to
> pg_stat_replication to be able to represent a node tree by basically
> adding to which group a node is assigned. I can draft that if needed,
> I am just a bit too lazy now...
>
> Honestly, this is not a matter of tooling. Even today if a DBA wants
> to change s_s_names without touching postgresql.conf you could just
> run ALTER SYSTEM and then reload parameters.

You're confusing two separate things. The primary manageability problem
has nothing to do with altering the parameter. The main problem is: if
there is more than one synch candidate, how do we determine *after the
master dies* which candidate replica was in synch at the time of
failure? Currently there is no way to do that. This proposal plans to,
effectively, add more synch candidate configurations without addressing
that core design failure *at all*. That's why I say that this patch
decreases overall reliability of the system instead of increasing it.

When I set up synch rep today, I never use more than two candidate synch
servers because of that very problem. And even with two I have to check
replay point because I have no way to tell which replica was in-sync at
the time of failure. Even in the current limited feature, this
significantly reduces the utility of synch rep. In your proposal, where
I could have multiple synch rep groups in multiple geos, how on Earth
could I figure out what to do when the master datacenter dies?

BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
(assuming it's one parameter) instead of some custom syntax. If it's
JSON, we can validate it in psql, whereas if it's some custom syntax we
have to wait for the db to reload and fail to figure out that we forgot
a comma. Using JSON would also permit us to use jsonb_set and
jsonb_delete to incrementally change the configuration.

Question: what happens *today* if we have two different synch rep
strings in two different *.conf files? I wouldn't assume that anyone
has tested this ...

>> It's always been a problem that one can accomplish a de-facto
>> denial-of-service by joining a cluster using the same application_name
>> as the synch standby, moreso because it's far too easy to do that
>> accidentally. One needs to simply make the mistake of copying
>> recovery.conf from the synch replica instead of the async replica, and
>> you've created a reliability problem.
>
> That's a scripting problem then. There are many ways to do a false
> manipulation in this area when setting up a standby. application_name
> value is one, you can do worse by pointing to an incorrect IP as well,
> miss a firewall filter or point to an incorrect port.

You're missing the point. We've created something unmanageable because
we piggy-backed it onto features intended for something else entirely.
Now you're proposing to piggy-back additional features on top of the
already teetering Bejing-acrobat-stack of piggy-backs we already have.
I'm saying that if you want synch rep to actually be a sophisticated,
high-availability system, you need it to actually be high-availability,
not just pile on additional configuration options.

I'm in favor of a more robust and sophisticated synch rep. But not if
nobody not on this mailing list can configure it, and not if even we
don't know what it will do in an actual failure situation.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-01 14:15:19
Message-ID:	CAHGQGwE_-HCzw687B4SdMWqAkkPcu-uxmF3MKyDB9mu38cJ7Jg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 06/29/2015 01:01 AM, Michael Paquier wrote:
>> On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>>> Right. Well, another reason we should be using a system catalog and not
>>> a single GUC ...

The problem by using system catalog to configure the synchronous replication
is that even configuration change needs to wait for its WAL record (i.e., caused
by change of system catalog) to be replicated. Imagine the case where you have
one synchronous standby but it does down. To keep the system up, you'd like
to switch the replication mode to asynchronous by changing the corresponding
system catalog. But that change may need to wait until synchronous standby
starts up again and its WAL record is successfully replicated. This means that
you may need to wait forever...

One approach to address this problem is to introduce something like unlogged
system catalog. I'm not sure if that causes another big problem, though...

> You're confusing two separate things. The primary manageability problem
> has nothing to do with altering the parameter. The main problem is: if
> there is more than one synch candidate, how do we determine *after the
> master dies* which candidate replica was in synch at the time of
> failure? Currently there is no way to do that. This proposal plans to,
> effectively, add more synch candidate configurations without addressing
> that core design failure *at all*. That's why I say that this patch
> decreases overall reliability of the system instead of increasing it.

I agree this is a problem even today, but it's basically independent from
the proposed feature *itself*. So I think that it's better to discuss and
work on the problem separately. If so, we might be able to provide
good way to find new master even if the proposed feature finally fails
to be adopted.

Regards,

--
Fujii Masao

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-01 14:45:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/26/15 1:46 AM, Michael Paquier wrote:
> - k(elt1,elt2,eltN) means that we need for the k elements in the set
> to return true (aka commit confirmation).
> - k[elt1,elt2,eltN] means that we need for the first k elements in the
> set to return true.

I think the difference between (...) and [...] is not intuitive. To me,
{...} would be more intuitive to indicate order does not matter.

> When k is not defined for a group, k = 1.

How about putting it at the end? Like

[foo,bar,baz](2)

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-01 14:47:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/26/15 2:53 PM, Josh Berkus wrote:
> I would also suggest that if I lose this battle and
> we decide to go with a single stringy GUC, that we at least use JSON
> instead of defining our out, proprietary, syntax?

Does JSON have a natural syntax for a set without order?

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-01 14:50:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7/1/15 10:15 AM, Fujii Masao wrote:
> One approach to address this problem is to introduce something like unlogged
> system catalog. I'm not sure if that causes another big problem, though...

Yeah, like the data disappearing after a crash. ;-)

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-01 14:55:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/26/15 1:12 PM, Josh Berkus wrote:
> If we're going to do quorum, multi-set synchrep, then we need to have a
> real management interface. Like, we really ought to have a system
> catalog and some built in functions to manage this instead, e.g.
>
> pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC)
>
> pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5')
>
> pg_modify_sync_set(quorum INT, set_members VARIADIC)
>
> pg_drop_synch_set(set_name NAME)

I respect that some people might like this, but I don't really see this
as an improvement. It's much easier for an administration person or
program to type out a list of standbys in a text file than having to go
through these interfaces that are non-idempotent, verbose, and only
available when the database server is up. The nice thing about a plain
and simple system is that you can build a complicated system on top of
it, if desired.

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-01 14:58:27
Message-ID:	CAD21AoBKAPZ1QNvhjfjBxqi56VrEYqgxrSS94jvU9x=U3BdotA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 06/29/2015 01:01 AM, Michael Paquier wrote:
>
> You're confusing two separate things. The primary manageability problem
> has nothing to do with altering the parameter. The main problem is: if
> there is more than one synch candidate, how do we determine *after the
> master dies* which candidate replica was in synch at the time of
> failure? Currently there is no way to do that. This proposal plans to,
> effectively, add more synch candidate configurations without addressing
> that core design failure *at all*. That's why I say that this patch
> decreases overall reliability of the system instead of increasing it.
>
> When I set up synch rep today, I never use more than two candidate synch
> servers because of that very problem. And even with two I have to check
> replay point because I have no way to tell which replica was in-sync at
> the time of failure. Even in the current limited feature, this
> significantly reduces the utility of synch rep. In your proposal, where
> I could have multiple synch rep groups in multiple geos, how on Earth
> could I figure out what to do when the master datacenter dies?

We can have same application name servers today, it's like group.
So there are two problems regarding fail-over:
1. How can we know which group(set) we should use? (group means
application_name here)
2. And how can we decide which a server of that group we should
promote to the next master server?

#1, it's one of the big problem, I think.
I haven't came up with correct solution yet, but we would need to know
which server(group) is the best for promoting
without the running old master server.
For example, improving pg_stat_replication view. or the mediation
process always check each progress of standby.

#2, I guess the best solution is that the DBA can promote any server of group.
That is, DBA always can promote server without considering state of
server of that group.
It's not difficult, always using lowest LSN of a group as group LSN.

>
> BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
> (assuming it's one parameter) instead of some custom syntax. If it's
> JSON, we can validate it in psql, whereas if it's some custom syntax we
> have to wait for the db to reload and fail to figure out that we forgot
> a comma. Using JSON would also permit us to use jsonb_set and
> jsonb_delete to incrementally change the configuration.

Sounds convenience and flexibility. I agree with this json format
parameter only if we don't combine both quorum and prioritization.
Because of backward compatibility.
I tend to use json format value and it's new separated GUC parameter.
Anyway, if we use json, I'm imaging parameter values like below.
{
"group1" : {
"quorum" : 1,
"standbys" : [
{
"a" : {
"quorum" : 2,
"standbys" : [
"c", "d"
]
}
},
"b"
]
}
}

> Question: what happens *today* if we have two different synch rep
> strings in two different *.conf files? I wouldn't assume that anyone
> has tested this ...

We use last defied parameter even if sync rep strings in several file, right?

Regards,

--
Sawada Masahiko

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-01 18:21:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

All:

Replying to multiple people below.

On 07/01/2015 07:15 AM, Fujii Masao wrote:
> On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> You're confusing two separate things. The primary manageability problem
>> has nothing to do with altering the parameter. The main problem is: if
>> there is more than one synch candidate, how do we determine *after the
>> master dies* which candidate replica was in synch at the time of
>> failure? Currently there is no way to do that. This proposal plans to,
>> effectively, add more synch candidate configurations without addressing
>> that core design failure *at all*. That's why I say that this patch
>> decreases overall reliability of the system instead of increasing it.
>
> I agree this is a problem even today, but it's basically independent from
> the proposed feature *itself*. So I think that it's better to discuss and
> work on the problem separately. If so, we might be able to provide
> good way to find new master even if the proposed feature finally fails
> to be adopted.

I agree that they're separate features. My argument is that the quorum
synch feature isn't materially useful if we don't create some feature to
identify which server(s) were in synch at the time the master died.

The main reason I'm arguing on this thread is that discussion of this
feature went straight into GUC syntax, without ever discussing:

* what use cases are we serving?
* what features do those use cases need?

I'm saying that we need to have that discussion first before we go into
syntax. We gave up on quorum commit in 9.1 partly because nobody was
convinced that it was actually useful; that case still needs to be
established, and if we can determine *under what circumstances* it's
useful, then we can know if the proposed feature we have is what we want
or not.

Myself, I have two use case for changes to sync rep:

1. the ability to specify a group of three replicas in the same data
center, and have commit succeed if it succeeds on two of them. The
purpose of this is to avoid data loss even if we lose the master and one
replica.

2. the ability to specify that synch needs to succeed on two replicas in
two different data centers. The idea here is to be able to ensure
consistency between all data centers.

Speaking of which: how does the proposed patch roll back the commit on
one replica if it fails to get quorum?

On 07/01/2015 07:55 AM, Peter Eisentraut wrote:> I respect that some
people might like this, but I don't really see this
> as an improvement. It's much easier for an administration person or
> program to type out a list of standbys in a text file than having to go
> through these interfaces that are non-idempotent, verbose, and only
> available when the database server is up. The nice thing about a plain
> and simple system is that you can build a complicated system on top of
> it, if desired.

I'm disagreeing that the proposed system is "plain and simple". What we
have now is simple; anything we try to add on top of it is goign to be
much less so. Frankly, given the proposed feature, I'm not sure that a
"plain and simple" implementation is *possible*; it's not a simple problem.

On 07/01/2015 07:58 AM, Sawada Masahiko wrote:> On Tue, Jun 30, 2015 at
> We can have same application name servers today, it's like group.
> So there are two problems regarding fail-over:
> 1. How can we know which group(set) we should use? (group means
> application_name here)
> 2. And how can we decide which a server of that group we should
> promote to the next master server?

Well, one possibility is to have each replica keep a flag which
indicates whether it thinks it's in sync or not. This flag would be
updated every time the replica sends a sync-ack to the master. There's a
couple issues with that though:

Synch Flag: the flag would need to be WAL-logged or written to disk
somehow on the replica, in case of the situation where the whole data
center shuts down, comes back up, and the master fails on restart. In
order for the replica to WAL-log this, we'd need to add special .sync
files to pg_xlog, like we currently have .history. Such a file could be
getting updated thousands of times per second, which is potentially an
issue. We could reduce writes by either synching to disk periodically,
or having the master write the sync state to a catalog, and replicate
it, but ...

Race Condition: there's a bit of a race condition during adverse
shutdown situations which could result in uncertainty, especially in
general data center failures and network failures which might not hit
all servers at the same time. If the master is wal-logging sync state,
this race condition is much worse, because it's pretty much certain that
one message updating sync state would be lost in the event of a master
crash. Likewise, if we don't log every synch state change, we've
widened the opportunity for a race condition.

> #1, it's one of the big problem, I think.
> I haven't came up with correct solution yet, but we would need to know
> which server(group) is the best for promoting
> without the running old master server.
> For example, improving pg_stat_replication view. or the mediation
> process always check each progress of standby.

Well, pg_stat_replication is useless for promotion, because if you need
to do an emergency promotion, you don't have access to that view.

Mind you, any adding additional synch configurations will require either
extra columns in pg_stat_replication, or a new system view, but that
doesn't help us for the failover issue.

> #2, I guess the best solution is that the DBA can promote any server
of group.
> That is, DBA always can promote server without considering state of
> server of that group.
> It's not difficult, always using lowest LSN of a group as group LSN.

Sure, but if we're going to do that, why use synch rep at all? Let
alone quorum commit?

> Sounds convenience and flexibility. I agree with this json format
> parameter only if we don't combine both quorum and prioritization.
> Because of backward compatibility.
> I tend to use json format value and it's new separated GUC parameter.

Well, we could just detect if the parameter begins with { or not. ;-)

We could also do an end-run around the current GUC code by not
permitting line breaks in the JSON.

>> Question: what happens *today* if we have two different synch rep
>> strings in two different *.conf files? I wouldn't assume that anyone
>> has tested this ...
>
> We use last defied parameter even if sync rep strings in several file,
right?

Yeah, I was just wondering if anyone had tested that.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 05:10:50
Message-ID:	CAB7nPqQt89j3rXfiFxmiCgoD72O1Fq1bNJaoe+dQHvPSuhPcEw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 1, 2015 at 11:45 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On 6/26/15 1:46 AM, Michael Paquier wrote:
>> - k(elt1,elt2,eltN) means that we need for the k elements in the set
>> to return true (aka commit confirmation).
>> - k[elt1,elt2,eltN] means that we need for the first k elements in the
>> set to return true.
>
> I think the difference between (...) and [...] is not intuitive. To me,
> {...} would be more intuitive to indicate order does not matter.

When defining a set of elements {} defines elements one by one, () and
[] are used for ranges. Perhaps the difference is better this way.

>> When k is not defined for a group, k = 1.
>
> How about putting it at the end? Like
>
> [foo,bar,baz](2)

I am less convinced by that, now I won't argue against it either.
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 05:29:26
Message-ID:	CAB7nPqSeNaCMbnWCL9uEqJHOLRZUn7fpbV_73N1iAnJ69XqJYQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 1, 2015 at 11:58 PM, Sawada Masahiko wrote:
> On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus wrote:
>>
>> BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
>> (assuming it's one parameter) instead of some custom syntax. If it's
>> JSON, we can validate it in psql, whereas if it's some custom syntax we
>> have to wait for the db to reload and fail to figure out that we forgot
>> a comma. Using JSON would also permit us to use jsonb_set and
>> jsonb_delete to incrementally change the configuration.
>
> Sounds convenience and flexibility. I agree with this json format
> parameter only if we don't combine both quorum and prioritization.
> Because of backward compatibility.
> I tend to use json format value and it's new separated GUC parameter.

This is going to make postgresql.conf unreadable. That does not look
very user-friendly, and a JSON object is actually longer in characters
than the formula spec proposed upthread.

> Anyway, if we use json, I'm imaging parameter values like below.
> [JSON]
>> Question: what happens *today* if we have two different synch rep
>> strings in two different *.conf files? I wouldn't assume that anyone
>> has tested this ...
> We use last defied parameter even if sync rep strings in several file, right?

The last one wins, that's the rule in GUCs. Note that
postgresql.auto.conf has the top priority over the rest, and that
files included in postgresql.conf have their value considered when
they are opened by the parser.

Well, the JSON format has merit, if stored as metadata in PGDATA such
as it is independent on WAL, in something like pg_syncdata/ and if it
can be modified with a useful interface, which is where Josh's first
idea could prove to be useful. We just need a clear representation of
the JSON schema we would use and with what kind of functions we could
manipulate it on top of a get/set that can be used to retrieve and
update the metadata as wanted.

In order to preserve backward-compatibility, set s_s_names as
'special_value' and switch to the old interface. We could consider
dropping it after a couple of releases and being sure that the new
system is stable.

Also, I think that we should rely on SIGHUP as a first step of the
implementation to update the status of sync nodes in backend
processes. As a future improvement we could perhaps get rid. Still it
seems safer to me to rely on a signal to update the in-memory status
as a first step as this is what we have now.
--
Michael

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 06:12:15
Message-ID:	CAHGQGwGU2DV0K17sHzyfLVAfq_cZm5ijAYGLwY7HkSgyX0brOw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 2, 2015 at 3:21 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> All:
>
> Replying to multiple people below.
>
> On 07/01/2015 07:15 AM, Fujii Masao wrote:
>> On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> You're confusing two separate things. The primary manageability problem
>>> has nothing to do with altering the parameter. The main problem is: if
>>> there is more than one synch candidate, how do we determine *after the
>>> master dies* which candidate replica was in synch at the time of
>>> failure? Currently there is no way to do that. This proposal plans to,
>>> effectively, add more synch candidate configurations without addressing
>>> that core design failure *at all*. That's why I say that this patch
>>> decreases overall reliability of the system instead of increasing it.
>>
>> I agree this is a problem even today, but it's basically independent from
>> the proposed feature *itself*. So I think that it's better to discuss and
>> work on the problem separately. If so, we might be able to provide
>> good way to find new master even if the proposed feature finally fails
>> to be adopted.
>
> I agree that they're separate features. My argument is that the quorum
> synch feature isn't materially useful if we don't create some feature to
> identify which server(s) were in synch at the time the master died.
>
> The main reason I'm arguing on this thread is that discussion of this
> feature went straight into GUC syntax, without ever discussing:
>
> * what use cases are we serving?
> * what features do those use cases need?
>
> I'm saying that we need to have that discussion first before we go into
> syntax. We gave up on quorum commit in 9.1 partly because nobody was
> convinced that it was actually useful; that case still needs to be
> established, and if we can determine *under what circumstances* it's
> useful, then we can know if the proposed feature we have is what we want
> or not.
>
> Myself, I have two use case for changes to sync rep:
>
> 1. the ability to specify a group of three replicas in the same data
> center, and have commit succeed if it succeeds on two of them. The
> purpose of this is to avoid data loss even if we lose the master and one
> replica.
>
> 2. the ability to specify that synch needs to succeed on two replicas in
> two different data centers. The idea here is to be able to ensure
> consistency between all data centers.

Yeah, I'm also thinking those *simple* use cases. I'm not sure
how many people really want to have very complicated quorum
commit setting.

> Speaking of which: how does the proposed patch roll back the commit on
> one replica if it fails to get quorum?

You meant the case where there are two sync replicas and the master
needs to wait until both send the ACK, then only one replica goes down?
In this case, the master receives the ACK from only one replica and
it must keep waiting until new sync replica appears and sends back
the ACK. So the committed transaction (written WAL record) would not
be rolled back.

> Well, one possibility is to have each replica keep a flag which
> indicates whether it thinks it's in sync or not. This flag would be
> updated every time the replica sends a sync-ack to the master. There's a
> couple issues with that though:

I don't think this is good approach because there can be the case where
you need to promote even the standby server not having sync flag.
Please imagine the case where you have sync and async standby servers.
When the master goes down, the async standby might be ahead of the
sync one. This is possible in practice. In this case, it might be better to
promote the async standby instead of sync one. Because the remaining
sync standby which is behind can easily follow up with new master.

We can promote the sync standby in this case. But since the remaining
async standby is ahead, it's not easy to follow up with new master.
Probably new base backup needs to be taken onto async standby from
new master, or pg_rewind needs to be executed. That is, the async
standby basically needs to be set up again.

So I'm thinking that we basically need to check the progress on each
standby to choose new master.

Regards,

--
Fujii Masao

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 06:29:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-02 PM 03:12, Fujii Masao wrote:
>
> So I'm thinking that we basically need to check the progress on each
> standby to choose new master.
>

Does HA software determine a standby to promote based on replication progress
or would things be reliable enough for it to infer one from the quorum setting
specified in GUC (or wherever)? Is part of the job of this patch to make the
latter possible? Just wondering or perhaps I am completely missing the point.

Thanks,
Amit

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 06:43:07
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Amit wrote:

> Does HA software determine a standby to promote based on replication
> progress
> or would things be reliable enough for it to infer one from the quorum
> setting
> specified in GUC (or wherever)? Is part of the job of this patch to make
> the
> latter possible? Just wondering or perhaps I am completely missing the
> point.

Deciding the failover standby is not exactly part of this patch but we
should be able to set up a mechanism to decide which is the best standby to
be promoted.

We might not be able to conclude this from the sync parameter alone.

As specified before in some cases an async standby could also be most
eligible for the promotion.

-----

Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856201.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 06:52:39
Message-ID:	CAB7nPqTVxTL3uTWML93BOEEh=2krG3hkT=R1_Y=7-Jee2WZ4KQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2015-07-02 PM 03:12, Fujii Masao wrote:
>>
>> So I'm thinking that we basically need to check the progress on each
>> standby to choose new master.
>>
>
> Does HA software determine a standby to promote based on replication progress
> or would things be reliable enough for it to infer one from the quorum setting
> specified in GUC (or wherever)? Is part of the job of this patch to make the
> latter possible? Just wondering or perhaps I am completely missing the point.

Replication progress is a factor of choice, but not the only one. The
sole role of this patch is just to allow us to have more advanced
policy in defining how synchronous replication works, aka how we want
to let the master acknowledge a commit synchronously from a set of N
standbys. In any case, this is something unrelated to the discussion
happening here.
--
Michael

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 07:12:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-02 PM 03:52, Michael Paquier wrote:
> On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> On 2015-07-02 PM 03:12, Fujii Masao wrote:
>>>
>>> So I'm thinking that we basically need to check the progress on each
>>> standby to choose new master.
>>>
>>
>> Does HA software determine a standby to promote based on replication progress
>> or would things be reliable enough for it to infer one from the quorum setting
>> specified in GUC (or wherever)? Is part of the job of this patch to make the
>> latter possible? Just wondering or perhaps I am completely missing the point.
>
> Replication progress is a factor of choice, but not the only one. The
> sole role of this patch is just to allow us to have more advanced
> policy in defining how synchronous replication works, aka how we want
> to let the master acknowledge a commit synchronously from a set of N
> standbys. In any case, this is something unrelated to the discussion
> happening here.
>

Got it, thanks!

Regards,
Amit

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 07:16:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-02 PM 03:43, Beena Emerson wrote:
> Amit wrote:
>
>> Does HA software determine a standby to promote based on replication
>> progress
>> or would things be reliable enough for it to infer one from the quorum
>> setting
>> specified in GUC (or wherever)? Is part of the job of this patch to make
>> the
>> latter possible? Just wondering or perhaps I am completely missing the
>> point.
>
> Deciding the failover standby is not exactly part of this patch but we
> should be able to set up a mechanism to decide which is the best standby to
> be promoted.
>
> We might not be able to conclude this from the sync parameter alone.
>
> As specified before in some cases an async standby could also be most
> eligible for the promotion.
>

Thanks for the explanation.

Regards,
Amit

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 08:44:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,
There has been a lot of discussion. It has become a bit confusing.
I am summarizing my understanding of the discussion till now.
Kindly let me know if I missed anything important.

Backward compatibility:
We have to provide support for the current format and behavior for
synchronous replication (The first running standby from list s_s_names)
In case the new format does not include GUC, then a special value to be
specified for s_s_names to indicate that.

Priority and quorum:
Quorum treats all the standby with same priority while in priority behavior,
each one has a different priority and ACK must be received from the
specified k lowest priority servers.
I am not sure how combining both will work out.
Mostly we would like to have some standbys from each data center to be in
sync. Can it not be achieved by quorum only?

GUC parameter:
There are some arguments over the text format. However if we continue using
this, specifying the number before the group is a more readable option than
specifying it later.
S_s_names = 3(A, (P,Q), 2(X,Y,Z)) is better compared to
S_s_names = (A, (P,Q), (X,Y,Z) (2)) (3)

Catalog Method:
Is it safe to assume we may not going ahead with the Catalog approach?
A system catalog and some built in functions to set the sync parameters is
not viable because it can cause
- promoted master to sync rep itself
- changes to catalog may continuously wait for ACK from a down server.
The main problem of unlogged system catalog is data loss during crash.

JSON:
I agree it would make GUC very complex and unreadable. We can consider using
is as meta data.
I think the only point in favor of JSON is to be able to set it using
functions instead of having to edit and reload right?

Identifying standby:
The main concern for the current use of application_name seems to be that
multiple standby with same name would form an intentional group (maybe
across data clusters too?).
I agree it would be better to have a mechanism to uniquely identify a
standby and groups can be made using whatever method we use to set the sync
requirements.

Main concern seems to be about deciding which standby is to be promoted is a
separate issue altogether.

-----

Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856216.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 12:31:46
Message-ID:	CAHGQGwFyW24z5h74MnBNM-djFyk-3XwPr0A3cVvTi4TWbzGSVg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 2, 2015 at 5:44 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> Hello,
> There has been a lot of discussion. It has become a bit confusing.
> I am summarizing my understanding of the discussion till now.
> Kindly let me know if I missed anything important.
>
> Backward compatibility:
> We have to provide support for the current format and behavior for
> synchronous replication (The first running standby from list s_s_names)
> In case the new format does not include GUC, then a special value to be
> specified for s_s_names to indicate that.
>
> Priority and quorum:
> Quorum treats all the standby with same priority while in priority behavior,
> each one has a different priority and ACK must be received from the
> specified k lowest priority servers.
> I am not sure how combining both will work out.
> Mostly we would like to have some standbys from each data center to be in
> sync. Can it not be achieved by quorum only?

So you're wondering if there is the use case where both quorum and priority are
used together?

For example, please imagine the case where you have two standby servers
(say A and B) in local site, and one standby server (say C) in remote disaster
recovery site. You want to set up sync replication so that the master waits for
ACK from either A or B, i.e., the setting of 1(A, B). Also only when either A
or B crashes, you want to make the master wait for ACK from either the
remaining local standby or C. On the other hand, you don't want to use the
setting like 1(A, B, C). Because in this setting, C can be sync standby when
the master craches, and both A and B might be very behind of C. In this case,
you need to promote the remote standby server C to new master,,, this is what
you'd like to avoid.

The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer.

Regards,

--
Fujii Masao

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 13:48:50
Message-ID:	CANP8+jJ26_BaFTEe2YokzbTNA6GKnwvnCg2BJJ2u+ONN-8_niw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2 July 2015 at 09:44, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:

> I am not sure how combining both will work out.
>

Use cases needed.

> Catalog Method:
> Is it safe to assume we may not going ahead with the Catalog approach?
>

Yes

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 18:10:27
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/01/2015 11:12 PM, Fujii Masao wrote:
> I don't think this is good approach because there can be the case where
> you need to promote even the standby server not having sync flag.
> Please imagine the case where you have sync and async standby servers.
> When the master goes down, the async standby might be ahead of the
> sync one. This is possible in practice. In this case, it might be better to
> promote the async standby instead of sync one. Because the remaining
> sync standby which is behind can easily follow up with new master.

If we're always going to be polling the replicas for furthest ahead,
then why bother implementing quorum synch at all? That's the basic
question I'm asking. What does it buy us that we don't already have?

I'm serious, here. Without any additional information on synch state at
failure time, I would never use quorum synch. If there's someone on
this thread who *would*, let's speak to their use case and then we can
actually get the feature right. Anyone?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 18:31:15
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-02 11:10:27 -0700, Josh Berkus wrote:
> If we're always going to be polling the replicas for furthest ahead,
> then why bother implementing quorum synch at all? That's the basic
> question I'm asking. What does it buy us that we don't already have?

What do those topic have to do with each other? A standby fundamentally
can be further ahead than what the primary knows about. So you can't do
very much with that knowledge on the master anyway?

> I'm serious, here. Without any additional information on synch state at
> failure time, I would never use quorum synch. If there's someone on
> this thread who *would*, let's speak to their use case and then we can
> actually get the feature right. Anyone?

How would you otherwise ensure that your data is both on a second server
in the same DC and in another DC? Which is a pretty darn common desire?

Greetings,

Andres Freund

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 18:50:44
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/02/2015 11:31 AM, Andres Freund wrote:
> On 2015-07-02 11:10:27 -0700, Josh Berkus wrote:
>> If we're always going to be polling the replicas for furthest ahead,
>> then why bother implementing quorum synch at all? That's the basic
>> question I'm asking. What does it buy us that we don't already have?
>
> What do those topic have to do with each other? A standby fundamentally
> can be further ahead than what the primary knows about. So you can't do
> very much with that knowledge on the master anyway?
>
>> I'm serious, here. Without any additional information on synch state at
>> failure time, I would never use quorum synch. If there's someone on
>> this thread who *would*, let's speak to their use case and then we can
>> actually get the feature right. Anyone?
>
> How would you otherwise ensure that your data is both on a second server
> in the same DC and in another DC? Which is a pretty darn common desire?

So there's two parts to this:

1. I need to ensure that data is replicated to X places.

2. I need to *know* which places data was synchronously replicated to
when the master goes down.

My entire point is that (1) alone is useless unless you also have (2).
And do note that I'm talking about information on the replica, not on
the master, since in any failure situation we don't have the old master
around to check.

Say you take this case:

"2" : { "local_replica", "london_server", "nyc_server" }

... which should ensure that any data which is replicated is replicated
to at least two places, so that even if you lose the entire local
datacenter, you have the data on at least one remote data center.

EXCEPT: say you lose both the local datacenter and communication with
the london server at the same time (due to transatlantic cable issues, a
huge DDOS, or whatever). You'd like to promote the NYC server to be the
new master, but only if it was in sync at the time its communication
with the original master was lost ... except that you have no way of
knowing that.

Given that, we haven't really reduced our data loss potential or
improved availabilty from the current 1-redundant synch rep. We still
need to wait to get the London server back to figure out if we want to
promote or not.

Now, this configuration would reduce the data loss window:

"3" : { "local_replica", "london_server", "nyc_server" }

As would this one:

"2" : { "local_replica", "nyc_server" }

... because we would know definitively which servers were in sync. So
maybe that's the use case we should be supporting?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-02 19:44:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
> So there's two parts to this:
>
> 1. I need to ensure that data is replicated to X places.
>
> 2. I need to *know* which places data was synchronously replicated to
> when the master goes down.
>
> My entire point is that (1) alone is useless unless you also have (2).

I think there's a good set of usecases where that's really not the case.

> And do note that I'm talking about information on the replica, not on
> the master, since in any failure situation we don't have the old
> master around to check.

How would you, even theoretically, synchronize that knowledge to all the
replicas? Even when they're temporarily disconnected?

> Say you take this case:
>
> "2" : { "local_replica", "london_server", "nyc_server" }
>
> ... which should ensure that any data which is replicated is replicated
> to at least two places, so that even if you lose the entire local
> datacenter, you have the data on at least one remote data center.

> EXCEPT: say you lose both the local datacenter and communication with
> the london server at the same time (due to transatlantic cable issues, a
> huge DDOS, or whatever). You'd like to promote the NYC server to be the
> new master, but only if it was in sync at the time its communication
> with the original master was lost ... except that you have no way of
> knowing that.

Pick up the phone, compare the lsns, done.

> Given that, we haven't really reduced our data loss potential or
> improved availabilty from the current 1-redundant synch rep. We still
> need to wait to get the London server back to figure out if we want to
> promote or not.
>
> Now, this configuration would reduce the data loss window:
>
> "3" : { "local_replica", "london_server", "nyc_server" }
>
> As would this one:
>
> "2" : { "local_replica", "nyc_server" }
>
> ... because we would know definitively which servers were in sync. So
> maybe that's the use case we should be supporting?

If you want automated failover you need a leader election amongst the
surviving nodes. The replay position is all they need to elect the node
that's furthest ahead, and that information exists today.

Greetings,

Andres Freund

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-02 21:54:19
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/02/2015 12:44 PM, Andres Freund wrote:
> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
>> So there's two parts to this:
>>
>> 1. I need to ensure that data is replicated to X places.
>>
>> 2. I need to *know* which places data was synchronously replicated to
>> when the master goes down.
>>
>> My entire point is that (1) alone is useless unless you also have (2).
>
> I think there's a good set of usecases where that's really not the case.

Please share! My plea for usecases was sincere. I can't think of any.

>> And do note that I'm talking about information on the replica, not on
>> the master, since in any failure situation we don't have the old
>> master around to check.
>
> How would you, even theoretically, synchronize that knowledge to all the
> replicas? Even when they're temporarily disconnected?

You can't, which is why what we need to know is when the replica thinks
it was last synced from the replica side. That is, a sync timestamp and
lsn from the last time the replica ack'd a sync commit back to the
master successfully. Based on that information, I can make an informed
decision, even if I'm down to one replica.

>> ... because we would know definitively which servers were in sync. So
>> maybe that's the use case we should be supporting?
>
> If you want automated failover you need a leader election amongst the
> surviving nodes. The replay position is all they need to elect the node
> that's furthest ahead, and that information exists today.

I can do that already. If quorum synch commit doesn't help us minimize
data loss any better than async replication or the current 1-redundant,
why would we want it? If it does help us minimize data loss, how?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-03 03:18:31
Message-ID:	CAHGQGwEu5dpGDLMnkOC3wHDH3NP2mc4OMmrwgriaJtZm58pZPQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 07/02/2015 12:44 PM, Andres Freund wrote:
>> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
>>> So there's two parts to this:
>>>
>>> 1. I need to ensure that data is replicated to X places.
>>>
>>> 2. I need to *know* which places data was synchronously replicated to
>>> when the master goes down.
>>>
>>> My entire point is that (1) alone is useless unless you also have (2).
>>
>> I think there's a good set of usecases where that's really not the case.
>
> Please share! My plea for usecases was sincere. I can't think of any.
>
>>> And do note that I'm talking about information on the replica, not on
>>> the master, since in any failure situation we don't have the old
>>> master around to check.
>>
>> How would you, even theoretically, synchronize that knowledge to all the
>> replicas? Even when they're temporarily disconnected?
>
> You can't, which is why what we need to know is when the replica thinks
> it was last synced from the replica side. That is, a sync timestamp and
> lsn from the last time the replica ack'd a sync commit back to the
> master successfully. Based on that information, I can make an informed
> decision, even if I'm down to one replica.
>
>>> ... because we would know definitively which servers were in sync. So
>>> maybe that's the use case we should be supporting?
>>
>> If you want automated failover you need a leader election amongst the
>> surviving nodes. The replay position is all they need to elect the node
>> that's furthest ahead, and that information exists today.
>
> I can do that already. If quorum synch commit doesn't help us minimize
> data loss any better than async replication or the current 1-redundant,
> why would we want it? If it does help us minimize data loss, how?

In your example of "2" : { "local_replica", "london_server", "nyc_server" },
if there is not something like quorum commit, only local_replica is synch
and the other two are async. In this case, if the local data center gets
destroyed, you need to promote either london_server or nyc_server. But
since they are async, they might not have the data which have been already
committed in the master. So data loss! Of course, as I said yesterday,
they might have all the data and no data loss happens at the promotion.
But the point is that there is no guarantee that no data loss happens.
OTOH, if we use quorum commit, we can guarantee that either london_server
or nyc_server has all the data which have been committed in the master.

So I think that quorum commit is helpful for minimizing the data loss.

Regards,

--
Fujii Masao

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-03 04:53:43
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
>
> Say you take this case:
>
> "2" : { "local_replica", "london_server", "nyc_server" }
>
> ... which should ensure that any data which is replicated is replicated
> to at least two places, so that even if you lose the entire local
> datacenter, you have the data on at least one remote data center.

Please consider the following:

If we have multiple replica on each DC, we can use the following:

3(local1, 1(london1, london2), 1(nyc1, nyc2))

In this case at least 1 from each DC is sync rep. When local and London
center is lost, NYC promotion can be done by comparing the LSN.

Also quorum would also ensure that even if one of the standby in a data
center goes down, another can take over, preventing data loss.

In the case 3(local1, london1, nyc1)

If nyc1, is down, the transaction would wait continuously. This can be
avoided.

-----

Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856394.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-03 07:34:13
Message-ID:	CAOG9ApF7dEvRcwm0tJMhBbWsXG87V-8a-HuSWJaNvTviTOa8Fg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

This has been registered in the next 2015-09 CF since majority are in favor
of adding this multiple sync replication feature (with quorum/priority).

New patch will be submitted once we have reached a consensus on the design.

--
Beena Emerson

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-03 08:59:03
Message-ID:	CAD21AoAVj7EypB1dG7LECzsyDV4+nZWPJbiyaTGZhdMWhr1EAw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 3, 2015 at 12:18 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 07/02/2015 12:44 PM, Andres Freund wrote:
>>> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
>>>> So there's two parts to this:
>>>>
>>>> 1. I need to ensure that data is replicated to X places.
>>>>
>>>> 2. I need to *know* which places data was synchronously replicated to
>>>> when the master goes down.
>>>>
>>>> My entire point is that (1) alone is useless unless you also have (2).
>>>
>>> I think there's a good set of usecases where that's really not the case.
>>
>> Please share! My plea for usecases was sincere. I can't think of any.
>>
>>>> And do note that I'm talking about information on the replica, not on
>>>> the master, since in any failure situation we don't have the old
>>>> master around to check.
>>>
>>> How would you, even theoretically, synchronize that knowledge to all the
>>> replicas? Even when they're temporarily disconnected?
>>
>> You can't, which is why what we need to know is when the replica thinks
>> it was last synced from the replica side. That is, a sync timestamp and
>> lsn from the last time the replica ack'd a sync commit back to the
>> master successfully. Based on that information, I can make an informed
>> decision, even if I'm down to one replica.
>>
>>>> ... because we would know definitively which servers were in sync. So
>>>> maybe that's the use case we should be supporting?
>>>
>>> If you want automated failover you need a leader election amongst the
>>> surviving nodes. The replay position is all they need to elect the node
>>> that's furthest ahead, and that information exists today.
>>
>> I can do that already. If quorum synch commit doesn't help us minimize
>> data loss any better than async replication or the current 1-redundant,
>> why would we want it? If it does help us minimize data loss, how?
>
> In your example of "2" : { "local_replica", "london_server", "nyc_server" },
> if there is not something like quorum commit, only local_replica is synch
> and the other two are async. In this case, if the local data center gets
> destroyed, you need to promote either london_server or nyc_server. But
> since they are async, they might not have the data which have been already
> committed in the master. So data loss! Of course, as I said yesterday,
> they might have all the data and no data loss happens at the promotion.
> But the point is that there is no guarantee that no data loss happens.
> OTOH, if we use quorum commit, we can guarantee that either london_server
> or nyc_server has all the data which have been committed in the master.
>
> So I think that quorum commit is helpful for minimizing the data loss.
>

Yeah, quorum commit is helpful for minimizing data loss in comparison
with today replication.
But in this your case, how can we know which server we should use as
the next master server, after local data center got down?
If we choose a wrong one, we would get the data loss.

Regards,

--
Sawada Masahiko

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-03 09:23:20
Message-ID:	CAHGQGwGzv7BHUSYO692ifxXxYrzEkaamO6DfXSBieEGtro_QYw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 3, 2015 at 5:59 PM, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Fri, Jul 3, 2015 at 12:18 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> On 07/02/2015 12:44 PM, Andres Freund wrote:
>>>> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
>>>>> So there's two parts to this:
>>>>>
>>>>> 1. I need to ensure that data is replicated to X places.
>>>>>
>>>>> 2. I need to *know* which places data was synchronously replicated to
>>>>> when the master goes down.
>>>>>
>>>>> My entire point is that (1) alone is useless unless you also have (2).
>>>>
>>>> I think there's a good set of usecases where that's really not the case.
>>>
>>> Please share! My plea for usecases was sincere. I can't think of any.
>>>
>>>>> And do note that I'm talking about information on the replica, not on
>>>>> the master, since in any failure situation we don't have the old
>>>>> master around to check.
>>>>
>>>> How would you, even theoretically, synchronize that knowledge to all the
>>>> replicas? Even when they're temporarily disconnected?
>>>
>>> You can't, which is why what we need to know is when the replica thinks
>>> it was last synced from the replica side. That is, a sync timestamp and
>>> lsn from the last time the replica ack'd a sync commit back to the
>>> master successfully. Based on that information, I can make an informed
>>> decision, even if I'm down to one replica.
>>>
>>>>> ... because we would know definitively which servers were in sync. So
>>>>> maybe that's the use case we should be supporting?
>>>>
>>>> If you want automated failover you need a leader election amongst the
>>>> surviving nodes. The replay position is all they need to elect the node
>>>> that's furthest ahead, and that information exists today.
>>>
>>> I can do that already. If quorum synch commit doesn't help us minimize
>>> data loss any better than async replication or the current 1-redundant,
>>> why would we want it? If it does help us minimize data loss, how?
>>
>> In your example of "2" : { "local_replica", "london_server", "nyc_server" },
>> if there is not something like quorum commit, only local_replica is synch
>> and the other two are async. In this case, if the local data center gets
>> destroyed, you need to promote either london_server or nyc_server. But
>> since they are async, they might not have the data which have been already
>> committed in the master. So data loss! Of course, as I said yesterday,
>> they might have all the data and no data loss happens at the promotion.
>> But the point is that there is no guarantee that no data loss happens.
>> OTOH, if we use quorum commit, we can guarantee that either london_server
>> or nyc_server has all the data which have been committed in the master.
>>
>> So I think that quorum commit is helpful for minimizing the data loss.
>>
>
> Yeah, quorum commit is helpful for minimizing data loss in comparison
> with today replication.
> But in this your case, how can we know which server we should use as
> the next master server, after local data center got down?
> If we choose a wrong one, we would get the data loss.

Check the progress of each server, e.g., by using
pg_last_xlog_replay_location(),
and choose the server which is ahead of as new master.

Regards,

--
Fujii Masao

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-03 10:12:46
Message-ID:	CAD21AoAKQk5__+_xGq3fMFvNK6YN5RmGCKJ6ubive=ZcF1NUBw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 3, 2015 at 6:23 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Jul 3, 2015 at 5:59 PM, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> Yeah, quorum commit is helpful for minimizing data loss in comparison
>> with today replication.
>> But in this your case, how can we know which server we should use as
>> the next master server, after local data center got down?
>> If we choose a wrong one, we would get the data loss.
>
> Check the progress of each server, e.g., by using
> pg_last_xlog_replay_location(),
> and choose the server which is ahead of as new master.
>

Thanks. So we can choice the next master server using by checking the
progress of each server, if hot standby is enabled.
And a such procedure is needed even today replication.

I think that the #2 problem which is Josh pointed out seems to be solved;
1. I need to ensure that data is replicated to X places.
2. I need to *know* which places data was synchronously replicated
to when the master goes down.
And we can address #1 problem using quorum commit.

Thought?

Regards,

--
Sawada Masahiko

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-03 11:29:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sawada Masahiko wrote:
>
> I think that the #2 problem which is Josh pointed out seems to be solved;
> 1. I need to ensure that data is replicated to X places.
> 2. I need to *know* which places data was synchronously replicated
> to when the master goes down.
> And we can address #1 problem using quorum commit.
>
> Thought?

I agree. The knowledge of which servers where in sync(#2) would not actually
help us determine the new master and quorum solves #1.

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856459.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-03 11:40:04
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-02 14:54:19 -0700, Josh Berkus wrote:
> On 07/02/2015 12:44 PM, Andres Freund wrote:
> > On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
> >> So there's two parts to this:
> >>
> >> 1. I need to ensure that data is replicated to X places.
> >>
> >> 2. I need to *know* which places data was synchronously replicated to
> >> when the master goes down.
> >>
> >> My entire point is that (1) alone is useless unless you also have (2).
> >
> > I think there's a good set of usecases where that's really not the case.
>
> Please share! My plea for usecases was sincere. I can't think of any.

"I have important data. I want to survive both a local hardware failure
(it's faster to continue using the local standby) and I want to protect
myself against actual disaster striking the primary datacenter". Pretty
common.

> >> And do note that I'm talking about information on the replica, not on
> >> the master, since in any failure situation we don't have the old
> >> master around to check.
> >
> > How would you, even theoretically, synchronize that knowledge to all the
> > replicas? Even when they're temporarily disconnected?
>
> You can't, which is why what we need to know is when the replica thinks
> it was last synced from the replica side. That is, a sync timestamp and
> lsn from the last time the replica ack'd a sync commit back to the
> master successfully. Based on that information, I can make an informed
> decision, even if I'm down to one replica.

I think you're mashing together nearly unrelated topics.

Note that we already have the last replayed lsn, and we have the
timestamp of the last replayed transaction.

> > If you want automated failover you need a leader election amongst the
> > surviving nodes. The replay position is all they need to elect the node
> > that's furthest ahead, and that information exists today.
>
> I can do that already. If quorum synch commit doesn't help us minimize
> data loss any better than async replication or the current 1-redundant,
> why would we want it? If it does help us minimize data loss, how?

But it does make us safer against data loss? If your app gets back the
commit you know that the data has made it both to the local replica and
one other datacenter. And you're now safe agains the loss of either the
master's hardware (most likely scenario) and safe against the loss of
the entire primary datacenter. That you need additional logic to know to
which other datacenter to fail over is just yet another piece (which you
*can* build today).

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-03 17:27:05
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/03/2015 03:12 AM, Sawada Masahiko wrote:
> Thanks. So we can choice the next master server using by checking the
> progress of each server, if hot standby is enabled.
> And a such procedure is needed even today replication.
>
> I think that the #2 problem which is Josh pointed out seems to be solved;
> 1. I need to ensure that data is replicated to X places.
> 2. I need to *know* which places data was synchronously replicated
> to when the master goes down.
> And we can address #1 problem using quorum commit.

It's not solved. I still have zero ways of knowing if a replica was in
sync or not at the time the master went down.

Now, you and others have argued persuasively that there are valuable use
cases for quorum commit even without solving that particular issue, but
there's a big difference between "we can work around this problem" and
the problem is solved. I forked the subject line because I think that
the inability to identify synch replicas under failover conditions is a
serious problem with synch rep *today*, and pretending that it doesn't
exist doesn't help us even if we don't fix it in 9.6.

Let me give you three cases where our lack of information on the replica
side about whether it thinks it's in sync or not causes synch rep to
fail to protect data. The first case is one I've actually seen in
production, and the other two are hypothetical but entirely plausible.

Case #1: two synchronous replica servers have the application name
"synchreplica". An admin uses the wrong Chef template, and deploys a
server which was supposed to be an async replica with the same
recovery.conf template, and it ends up in the "synchreplica" group as
well. Due to restarts (pushing out an update release), the new server
ends up seizing and keeping sync. Then the master dies. Because the new
server wasn't supposed to be a sync replica in the first place, it is
not checked; they just fail over to the furthest ahead of the two
original synch replicas, neither of which was actually in synch.

Case #2: "2 { local, london, nyc }" setup. At 2am, the links between
data centers become unreliable, such that the on-call sysadmin disables
synch rep because commits on the master are intolerably slow. Then, at
10am, the links between data centers fail entirely. The day shift, not
knowing that the night shift disabled sync, fail over to London thinking
that they can do so with zero data loss.

Case #3 "1 { london, frankfurt }, 1 { sydney, tokyo }" multi-group
priority setup. We lose communication with everything but Europe. How
can we decide whether to wait to get sydney back, or to promote London
immedately?

I could come up with numerous other situations, but all of the three
above completely reasonable cases show how having the knowledge of what
time a replica thought it was last in sync is vital to preventing bad
failovers and data loss, and to knowing the quantity of data loss when
it can't be prevented.

It's an issue *now* that the only data we have about the state of sync
rep is on the master, and dies with the master. And it severely limits
the actual utility of our synch rep. People implement synch rep in the
first place because the "best effort" of asynch rep isn't good enough
for them, and yet when it comes to failover we're just telling them
"give it your best effort".

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-03 17:44:22
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-03 10:27:05 -0700, Josh Berkus wrote:
> On 07/03/2015 03:12 AM, Sawada Masahiko wrote:
> > Thanks. So we can choice the next master server using by checking the
> > progress of each server, if hot standby is enabled.
> > And a such procedure is needed even today replication.
> >
> > I think that the #2 problem which is Josh pointed out seems to be solved;
> > 1. I need to ensure that data is replicated to X places.
> > 2. I need to *know* which places data was synchronously replicated
> > to when the master goes down.
> > And we can address #1 problem using quorum commit.
>
> It's not solved. I still have zero ways of knowing if a replica was in
> sync or not at the time the master went down.

What?

You pick the standby that's furthest ahead. And you use a high enough
quorum so that given your tolerance for failures you'll always be able
to reach at least one of the synchronous replicas. Then you promote the
one with the highest LSN. Done.

This is something that gets *easier* by quorum, not harder.

> I forked the subject line because I think that the inability to
> identify synch replicas under failover conditions is a serious problem
> with synch rep *today*, and pretending that it doesn't exist doesn't
> help us even if we don't fix it in 9.6.

That's just not how failovers can sanely work. And again, you *have* the
information you can have on the standbys already. You *know* what/from
when the last replayed xact is.

> Let me give you three cases where our lack of information on the replica
> side about whether it thinks it's in sync or not causes synch rep to
> fail to protect data. The first case is one I've actually seen in
> production, and the other two are hypothetical but entirely plausible.
>
> Case #1: two synchronous replica servers have the application name
> "synchreplica". An admin uses the wrong Chef template, and deploys a
> server which was supposed to be an async replica with the same
> recovery.conf template, and it ends up in the "synchreplica" group as
> well. Due to restarts (pushing out an update release), the new server
> ends up seizing and keeping sync. Then the master dies. Because the new
> server wasn't supposed to be a sync replica in the first place, it is
> not checked; they just fail over to the furthest ahead of the two
> original synch replicas, neither of which was actually in synch.

Nobody can protect you against such configuration errors. We can make it
harder to misconfigure, sure, but it doesn't have anything to do with
the topic at hand.

> Case #2: "2 { local, london, nyc }" setup. At 2am, the links between
> data centers become unreliable, such that the on-call sysadmin disables
> synch rep because commits on the master are intolerably slow. Then, at
> 10am, the links between data centers fail entirely. The day shift, not
> knowing that the night shift disabled sync, fail over to London thinking
> that they can do so with zero data loss.

As I said earlier, you can check against that today by checking the last
replayed timestamp. SELECT pg_last_xact_replay_timestamp();

You don't have to pick the one that used to be a sync replica. You pick
the one with the most data received.

If the day shift doesn't bother to check the standbys now, they'd not
check either if they had some way to check whether a node was the chosen
sync replica.

> Case #3 "1 { london, frankfurt }, 1 { sydney, tokyo }" multi-group
> priority setup. We lose communication with everything but Europe. How
> can we decide whether to wait to get sydney back, or to promote London
> immedately?

You normally don't continue automatically at all in that situation. To
avoid/minimize data loss you want to have a majority election system to
select the new primary. That requires reaching the majority of the
nodes. This isn't something specific to postgres, if you look at any
solution out there, they're also doing it that way.

Statically choosing which of the replicas in a group is the current sync
one is a *bad* idea. You want to ensure that at least node in a group
has received the data, and stop waiting as soon that's the case.

> It's an issue *now* that the only data we have about the state of sync
> rep is on the master, and dies with the master. And it severely limits
> the actual utility of our synch rep. People implement synch rep in the
> first place because the "best effort" of asynch rep isn't good enough
> for them, and yet when it comes to failover we're just telling them
> "give it your best effort".

We don't tell them that, but apparently you do.

This subthread is getting absurd, stopping here.

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Synch failover WAS: Support for N synchronous standby servers - take 2
Date:	2015-07-04 01:16:58
Message-ID:	CAB7nPqRjMqRzj6jBZDa+PJSQE=TBR3WunRsW_FFq5Gk_Gn4+gA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jul 4, 2015 at 2:44 AM, Andres Freund wrote:
> This subthread is getting absurd, stopping here.

Yeah, I agree with Andres here, we are making a mountain of nothing
(Frenglish?). I'll send to the other thread some additional ideas soon
using a JSON structure.
--
Michael

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-06 17:03:13
Message-ID:	CAD21AoBMTq+a7+omtSiWPsM2Fwk=6rWdSz-roALd7a4QyG7b8g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 2, 2015 at 9:31 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Thu, Jul 2, 2015 at 5:44 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>> Hello,
>> There has been a lot of discussion. It has become a bit confusing.
>> I am summarizing my understanding of the discussion till now.
>> Kindly let me know if I missed anything important.
>>
>> Backward compatibility:
>> We have to provide support for the current format and behavior for
>> synchronous replication (The first running standby from list s_s_names)
>> In case the new format does not include GUC, then a special value to be
>> specified for s_s_names to indicate that.
>>
>> Priority and quorum:
>> Quorum treats all the standby with same priority while in priority behavior,
>> each one has a different priority and ACK must be received from the
>> specified k lowest priority servers.
>> I am not sure how combining both will work out.
>> Mostly we would like to have some standbys from each data center to be in
>> sync. Can it not be achieved by quorum only?
>
> So you're wondering if there is the use case where both quorum and priority are
> used together?
>
> For example, please imagine the case where you have two standby servers
> (say A and B) in local site, and one standby server (say C) in remote disaster
> recovery site. You want to set up sync replication so that the master waits for
> ACK from either A or B, i.e., the setting of 1(A, B). Also only when either A
> or B crashes, you want to make the master wait for ACK from either the
> remaining local standby or C. On the other hand, you don't want to use the
> setting like 1(A, B, C). Because in this setting, C can be sync standby when
> the master craches, and both A and B might be very behind of C. In this case,
> you need to promote the remote standby server C to new master,,, this is what
> you'd like to avoid.
>
> The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer.
>

If we set the remote disaster recovery site up as synch replica, we
would get some big latencies even though we use quorum commit.
So I think this case Fujii-san suggested is a good configuration, and
many users would want to use it.
I tend to agree with combine quorum and prioritization into one GUC
parameter while keeping backward compatibility.

Regards,

--
Sawada Masahiko

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-06 17:56:07
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/06/2015 10:03 AM, Sawada Masahiko wrote:
>> > The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer.
>> >
> If we set the remote disaster recovery site up as synch replica, we
> would get some big latencies even though we use quorum commit.
> So I think this case Fujii-san suggested is a good configuration, and
> many users would want to use it.
> I tend to agree with combine quorum and prioritization into one GUC
> parameter while keeping backward compatibility.

OK, so here's the arguments pro-JSON and anti-JSON:

pro-JSON:

* standard syntax which is recognizable to sysadmins and devops.
* can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
additions/deletions from the synch rep config.
* can add group labels (see below)

anti-JSON:

* more verbose
* syntax is not backwards-compatible, we'd need a switch
* people will want to use line breaks, which we can't support

Re: group labels: I see a lot of value in being able to add names to
quorum groups. Think about how this will be represented in system
views; it will be difficult to show sync status of any quorum group in
any meaningful way if the group has no label, and any system-assigned
label would change unpredictably from the user's perspective.

To give a JSON example, let's take the case of needing to sync to two of
the servers in either London or NC:

'{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [
"london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1,
"servers" [ "nc1", "nc2" ] } }'

This says: as the "remotes" group, synch with a quorum of 2 servers in
london and a quorum of 1 server in NC. This assumes for
backwards-compatibility reasons that we support a priority list of
groups of quorums, and not some other combination (see below for more on
this).

The advantage of having these labels is that it becomes easy to
represent statuses for them:

sync_group state definition
remotes waiting { "london_servers" : { "quorum" ...
london_servers synced { "quorum" : 2, "servers" : ...
nc_servers waiting { "quorum" : 1, "servers" [ ...

Without labels, we force the DBA to track groups by raw definitions,
which would be difficult. Also, there's the question of what we do on
reload with any statuses of synch groups which are currently in-process,
if we don't have a stable key with which to identify groups.

The other grammar issue has to do with the nesting nature of quorums and
priorities. A theoretical user could want:

* a priority list of quorum groups
* a quorum group of priority lists
* a quorum group of quorum groups
* a priority list of quorum groups of quorum groups
* a quorum group of quorum groups of priority lists
... etc.

I don't really see any possible end to the possible permutations, which
is why it would be good to establish some real use cases from now in
order to figure out what we really want to support. Absent that, my
inclination is that we should implement the simplest possible thing
(i.e. no nesting) for 9.5.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-07 00:42:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-07 AM 02:56, Josh Berkus wrote:
>
> Re: group labels: I see a lot of value in being able to add names to
> quorum groups. Think about how this will be represented in system
> views; it will be difficult to show sync status of any quorum group in
> any meaningful way if the group has no label, and any system-assigned
> label would change unpredictably from the user's perspective.
>
> To give a JSON example, let's take the case of needing to sync to two of
> the servers in either London or NC:
>
> '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [
> "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1,
> "servers" [ "nc1", "nc2" ] } }'
>

What if we write the above as:

remotes-1 (london_servers-2 [london1, london2, london3], nc_servers-1 [nc1, nc2])

That requires only slightly altering the proposed format, that is prepend sync
group label string to the quorum number. The monitoring view can be made to
internally generate JSON output (if needed) from it. It does not seem very
ALTER SYSTEM SET friendly but there are trade-offs either way.

Just my 2c.

Thanks,
Amit

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-07 01:40:17
Message-ID:	CAB7nPqTNP6vFwEMj5ktZtj6ZaYnQ=wOfTHp-2gYWjjW19ZFuZg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> pro-JSON:
>
> * standard syntax which is recognizable to sysadmins and devops.
> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
> additions/deletions from the synch rep config.
> * can add group labels (see below)

If we go this way, I think that managing a JSON blob with a GUC
parameter is crazy, this is way longer in character size than a simple
formula because of the key names. Hence, this JSON blob should be in a
separate place than postgresql.conf not within the catalog tables,
manageable using an SQL interface, and reloaded in backends using
SIGHUP.

> anti-JSON:
> * more verbose
> * syntax is not backwards-compatible, we'd need a switch

This point is valid as well in the pro-JSON portion.

> * people will want to use line breaks, which we can't support

Yes, this is caused by the fact of using a GUC. For a simple formula
this seems fine to me though, that's what we have today for s_s_names
and using a formula is not much longer in character size than what we
have now.

> Re: group labels: I see a lot of value in being able to add names to
> quorum groups. Think about how this will be represented in system
> views; it will be difficult to show sync status of any quorum group in
> any meaningful way if the group has no label, and any system-assigned
> label would change unpredictably from the user's perspective.
> To give a JSON example, let's take the case of needing to sync to two of
> the servers in either London or NC:
>
> '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [
> "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1,
> "servers" [ "nc1", "nc2" ] } }'

The JSON blob managing sync node information could contain additional
JSON objects that register a set of nodes as a given group. More
easily, you could use let's say the following structure to store the
blobs:
- pg_syncinfo/global, to store the root of the formula, that could use groups.
- pg_syncinfo/groups/$GROUP_NAME to store a set JSON blobs representing a group.

> The advantage of having these labels is that it becomes easy to
> represent statuses for them:
>
> sync_group state definition
> remotes waiting { "london_servers" : { "quorum" ...
> london_servers synced { "quorum" : 2, "servers" : ...
> nc_servers waiting { "quorum" : 1, "servers" [ ...
> Without labels, we force the DBA to track groups by raw definitions,
> which would be difficult. Also, there's the question of what we do on
> reload with any statuses of synch groups which are currently in-process,
> if we don't have a stable key with which to identify groups.

Well, yes.

> The other grammar issue has to do with the nesting nature of quorums and
> priorities. A theoretical user could want:
>
> * a priority list of quorum groups
> * a quorum group of priority lists
> * a quorum group of quorum groups
> * a priority list of quorum groups of quorum groups
> * a quorum group of quorum groups of priority lists
> ... etc.
>
> I don't really see any possible end to the possible permutations, which
> is why it would be good to establish some real use cases from now in
> order to figure out what we really want to support. Absent that, my
> inclination is that we should implement the simplest possible thing
> (i.e. no nesting) for 9.5.

I am not sure I agree that this will simplify the work. Currently
s_s_names has already 1 level, and we want to append groups to each
element of it as well, meaning that we'll need at least 2 level of
nesting.
--
Michael

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-07 03:51:32
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/06/2015 06:40 PM, Michael Paquier wrote:
> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> pro-JSON:
>>
>> * standard syntax which is recognizable to sysadmins and devops.
>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>> additions/deletions from the synch rep config.
>> * can add group labels (see below)
>
> If we go this way, I think that managing a JSON blob with a GUC
> parameter is crazy, this is way longer in character size than a simple
> formula because of the key names. Hence, this JSON blob should be in a
> separate place than postgresql.conf not within the catalog tables,
> manageable using an SQL interface, and reloaded in backends using
> SIGHUP.

I'm not following this at all. What are you saying here?

>> I don't really see any possible end to the possible permutations, which
>> is why it would be good to establish some real use cases from now in
>> order to figure out what we really want to support. Absent that, my
>> inclination is that we should implement the simplest possible thing
>> (i.e. no nesting) for 9.5.
>
> I am not sure I agree that this will simplify the work. Currently
> s_s_names has already 1 level, and we want to append groups to each
> element of it as well, meaning that we'll need at least 2 level of
> nesting.

Well, we have to draw a line somewhere, unless we're going to support
infinite recursion.

And if we are going to support infinitie recursion, and kind of compact
syntax for a GUC isn't even worth talking about ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-07 04:56:36
Message-ID:	CAB7nPqSSxX9hARyDKA0OQ4LhKrdW69dqcpLi84o0pguiVj2VMQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 07/06/2015 06:40 PM, Michael Paquier wrote:
>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> pro-JSON:
>>>
>>> * standard syntax which is recognizable to sysadmins and devops.
>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>>> additions/deletions from the synch rep config.
>>> * can add group labels (see below)
>>
>> If we go this way, I think that managing a JSON blob with a GUC
>> parameter is crazy, this is way longer in character size than a simple
>> formula because of the key names. Hence, this JSON blob should be in a
>> separate place than postgresql.conf not within the catalog tables,
>> manageable using an SQL interface, and reloaded in backends using
>> SIGHUP.
>
> I'm not following this at all. What are you saying here?

A JSON string is longer in terms of number of characters than a
formula because it contains key names, and those key names are usually
repeated several times, making it harder to read in a configuration
file. So what I am saying that that we do not save it as a GUC, but as
a separate metadata that can be accessed with a set of SQL functions
to manipulate it.
--
Michael

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-07 05:19:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/06/2015 09:56 PM, Michael Paquier wrote:
> On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 07/06/2015 06:40 PM, Michael Paquier wrote:
>>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>>> pro-JSON:
>>>>
>>>> * standard syntax which is recognizable to sysadmins and devops.
>>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>>>> additions/deletions from the synch rep config.
>>>> * can add group labels (see below)
>>>
>>> If we go this way, I think that managing a JSON blob with a GUC
>>> parameter is crazy, this is way longer in character size than a simple
>>> formula because of the key names. Hence, this JSON blob should be in a
>>> separate place than postgresql.conf not within the catalog tables,
>>> manageable using an SQL interface, and reloaded in backends using
>>> SIGHUP.
>>
>> I'm not following this at all. What are you saying here?
>
> A JSON string is longer in terms of number of characters than a
> formula because it contains key names, and those key names are usually
> repeated several times, making it harder to read in a configuration
> file. So what I am saying that that we do not save it as a GUC, but as
> a separate metadata that can be accessed with a set of SQL functions
> to manipulate it.

Where, though? Someone already pointed out the issues with storing it
in a system catalog, and adding an additional .conf file with a
different format is too horrible to contemplate.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-07 05:32:57
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:

> '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [
> "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1,
> "servers" [ "nc1", "nc2" ] } }'
>
> This says: as the "remotes" group, synch with a quorum of 2 servers in
> london and a quorum of 1 server in NC.

I wanted to clarify about the format.
The remotes group does not specify any quorum, only its individual elements
mention the quorum.
"remotes" is said to sync in london_servers "and" NC.
Would absence of a quorum number in a group mean "all" elements?
Or the above would be represented as following to imply "AND" between the 2
DC.

'{ "remotes" :
"quorum" : 2, "servers" :
{ "london_servers" :
{ "quorum" : 2, "servers" : [ "london1", "london2", "london3" ] },
"nc_servers" :
{ "quorum" : 1, "servers" : [ "nc1", "nc2" ] }
}
}'

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856868.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-07 05:38:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Amit wrote:
> What if we write the above as:
>
> remotes-1 (london_servers-2 [london1, london2, london3], nc_servers-1
> [nc1, nc2])

Yes this we can consider.

Thanks,

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856869.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-07 06:03:50
Message-ID:	CAB7nPqRKXWcK_V4bEu3xj1XJnq7Yx9doZa_3R7pKh6ekgEszmA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 7, 2015 at 2:19 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 07/06/2015 09:56 PM, Michael Paquier wrote:
>> On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> On 07/06/2015 06:40 PM, Michael Paquier wrote:
>>>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>>>> pro-JSON:
>>>>>
>>>>> * standard syntax which is recognizable to sysadmins and devops.
>>>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>>>>> additions/deletions from the synch rep config.
>>>>> * can add group labels (see below)
>>>>
>>>> If we go this way, I think that managing a JSON blob with a GUC
>>>> parameter is crazy, this is way longer in character size than a simple
>>>> formula because of the key names. Hence, this JSON blob should be in a
>>>> separate place than postgresql.conf not within the catalog tables,
>>>> manageable using an SQL interface, and reloaded in backends using
>>>> SIGHUP.
>>>
>>> I'm not following this at all. What are you saying here?
>>
>> A JSON string is longer in terms of number of characters than a
>> formula because it contains key names, and those key names are usually
>> repeated several times, making it harder to read in a configuration
>> file. So what I am saying that that we do not save it as a GUC, but as
>> a separate metadata that can be accessed with a set of SQL functions
>> to manipulate it.
>
> Where, though? Someone already pointed out the issues with storing it
> in a system catalog, and adding an additional .conf file with a
> different format is too horrible to contemplate.

Something like pg_syncinfo/ coupled with a LW lock, we already do
something similar for replication slots with pg_replslot/.
--
Michael

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-10 13:06:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Tue, Jul 7, 2015 at 02:56 AM, Josh Berkus wrote:
> pro-JSON:
>
> * standard syntax which is recognizable to sysadmins and devops.
> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
> additions/deletions from the synch rep config.
> * can add group labels (see below)

Adding group labels do have a lot of values but as Amit has pointed out,
with little modification, they can be included in GUC as well. It will not
make it any more complex.

On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:

> Something like pg_syncinfo/ coupled with a LW lock, we already do
> something similar for replication slots with pg_replslot/.

I was trying to figure out how the JSON metadata can be used.
It would have to be set using a given set of functions. Right?
I am sorry this question is very basic.

The functions could be something like:
1. pg_add_synch_set(set_name NAME, quorum INT, is_priority bool, set_members
VARIADIC)

This will be used to add a sync set. The set_members can be individual
elements of another set name. The parameter is_priority is used to decide
whether the set is priority (true) set or quorum (false). This function call
will create a folder pg_syncinfo/groups/$NAME and store the json blob?

The root group would be automatically sset by finding the group which is not
included in other groups? or can be set by another function?

2. pg_modify_sync_set(set_name NAME, quorum INT, is_priority bool,
set_members VARIADIC)

This will update the pg_syncinfo/groups/$NAME to store the new values.

3. pg_drop_synch_set(set_name NAME)

This will update the pg_syncinfo/groups/$NAME folder. Also all the groups
which included this would be updated?

4. pg_show_synch_set()

this will display the current sync setting in json format.

Am I missing something?

Is JSON being preferred because it would be ALTER SYSTEM friendly and in a
format already known to users?

In a real-life scenario, at most how many groups and nesting would be
expected?

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5857516.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-13 12:22:55
Message-ID:	CAD21AoA1bOzUa+9sdq1rcfqsC8LUdCPDbYo+_E=4fvRAd_NGmg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> Hello,
>
> Tue, Jul 7, 2015 at 02:56 AM, Josh Berkus wrote:
>> pro-JSON:
>>
>> * standard syntax which is recognizable to sysadmins and devops.
>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>> additions/deletions from the synch rep config.
>> * can add group labels (see below)
>
> Adding group labels do have a lot of values but as Amit has pointed out,
> with little modification, they can be included in GUC as well. It will not
> make it any more complex.
>
> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
>
>> Something like pg_syncinfo/ coupled with a LW lock, we already do
>> something similar for replication slots with pg_replslot/.
>
> I was trying to figure out how the JSON metadata can be used.
> It would have to be set using a given set of functions. Right?
> I am sorry this question is very basic.
>
> The functions could be something like:
> 1. pg_add_synch_set(set_name NAME, quorum INT, is_priority bool, set_members
> VARIADIC)
>
> This will be used to add a sync set. The set_members can be individual
> elements of another set name. The parameter is_priority is used to decide
> whether the set is priority (true) set or quorum (false). This function call
> will create a folder pg_syncinfo/groups/$NAME and store the json blob?
>
> The root group would be automatically sset by finding the group which is not
> included in other groups? or can be set by another function?
>
> 2. pg_modify_sync_set(set_name NAME, quorum INT, is_priority bool,
> set_members VARIADIC)
>
> This will update the pg_syncinfo/groups/$NAME to store the new values.
>
> 3. pg_drop_synch_set(set_name NAME)
>
> This will update the pg_syncinfo/groups/$NAME folder. Also all the groups
> which included this would be updated?
>
> 4. pg_show_synch_set()
>
> this will display the current sync setting in json format.
>
> Am I missing something?
>
> Is JSON being preferred because it would be ALTER SYSTEM friendly and in a
> format already known to users?
>
> In a real-life scenario, at most how many groups and nesting would be
> expected?
>

I might missing something but, these functions will generate WAL?
If they does, we will face the situation where we need to wait
forever, Fujii-san pointed out.

Regards,

--
Masahiko Sawada

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-13 12:26:32
Message-ID:	CAB7nPqQpRFnikqGDMbtxeEsj6sO40TYHh8D9zESQ+GA8KnEKXQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 13, 2015 at 9:22 PM, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com> wrote:
> I might missing something but, these functions will generate WAL?
> If they does, we will face the situation where we need to wait
> forever, Fujii-san pointed out.

No, those functions are here to manipulate the metadata defining the
quorum/priority set. We definitely do not want something that
generates WAL.
--
Michael

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-13 13:34:56
Message-ID:	CAHGQGwH5SHdS+2Qa1nM9-+ME5frb56eqn-2c=ja_UO=zUmJcEA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Or you can extend the custom GUC mechanism so that we can
specify the groups by using them, for example,

quorum_commit.mygroup1 = 'london, nyc'
quorum_commit.mygruop2 = 'tokyo, pune'
synchronous_standby_names = '1(mygroup1), 1(mygroup2)'

> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
>
>> Something like pg_syncinfo/ coupled with a LW lock, we already do
>> something similar for replication slots with pg_replslot/.
>
> I was trying to figure out how the JSON metadata can be used.
> It would have to be set using a given set of functions.

So we can use only such a set of functions to configure synch rep?
I don't like that idea. Because it prevents us from configuring that
while the server is not running.

> Is JSON being preferred because it would be ALTER SYSTEM friendly and in a
> format already known to users?

At least currently ALTER SYSTEM cannot accept the JSON data
(e.g., the return value of JSON function like json_build_object())
as the setting value. So I'm not sure how friendly ALTER SYSTEM
and JSON format really. If you want to argue that, probably you
need to improve ALTER SYSTEM so that JSON can be specified.

> In a real-life scenario, at most how many groups and nesting would be
> expected?

I don't think that many groups and nestings are common.

Regards,

--
Fujii Masao

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-14 00:00:13
Message-ID:	CAB7nPqSmy3M5DWu2BLOSYBbQeB9f853RCF+h_JU03k67u4+6Eg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote:
> On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson wrote:
>> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
>>
>>> Something like pg_syncinfo/ coupled with a LW lock, we already do
>>> something similar for replication slots with pg_replslot/.
>>
>> I was trying to figure out how the JSON metadata can be used.
>> It would have to be set using a given set of functions.
>
> So we can use only such a set of functions to configure synch rep?
> I don't like that idea. Because it prevents us from configuring that
> while the server is not running.

If you store a json blob in a set of files of PGDATA you could update
them manually there as well. That's perhaps re-inventing the wheel
with what is available with GUCs though.

>> Is JSON being preferred because it would be ALTER SYSTEM friendly and in a
>> format already known to users?
>
> At least currently ALTER SYSTEM cannot accept the JSON data
> (e.g., the return value of JSON function like json_build_object())
> as the setting value. So I'm not sure how friendly ALTER SYSTEM
> and JSON format really. If you want to argue that, probably you
> need to improve ALTER SYSTEM so that JSON can be specified.
>
>> In a real-life scenario, at most how many groups and nesting would be
>> expected?
>
> I don't think that many groups and nestings are common.

Yeah, in most common configurations people are not going to have more
than 3 groups with only one level of nodes.
--
Michael

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-14 01:45:53
Message-ID:	CAHGQGwHni4KSgiGchhMY1mrTCM_Ez9x2LJQ-hE7vZZ3EvVPypw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 14, 2015 at 9:00 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote:
>> On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson wrote:
>>> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
>>>
>>>> Something like pg_syncinfo/ coupled with a LW lock, we already do
>>>> something similar for replication slots with pg_replslot/.
>>>
>>> I was trying to figure out how the JSON metadata can be used.
>>> It would have to be set using a given set of functions.
>>
>> So we can use only such a set of functions to configure synch rep?
>> I don't like that idea. Because it prevents us from configuring that
>> while the server is not running.
>
> If you store a json blob in a set of files of PGDATA you could update
> them manually there as well. That's perhaps re-inventing the wheel
> with what is available with GUCs though.

Why don't we just use GUC? If the quorum setting is not so complicated
in real scenario, GUC seems enough for that.

Regards,

--
Fujii Masao

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 05:11:56
Message-ID:	CAOG9ApFR8+tzRVEzkvj6ftVkCFmpXeaBmGk+AfyynC4GxiFC9w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2015 7:15 AM, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> On Tue, Jul 14, 2015 at 9:00 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> > On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote:
> >> On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson wrote:
> >>> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
> >>>
> >>>> Something like pg_syncinfo/ coupled with a LW lock, we already do
> >>>> something similar for replication slots with pg_replslot/.
> >>>
> >>> I was trying to figure out how the JSON metadata can be used.
> >>> It would have to be set using a given set of functions.
> >>
> >> So we can use only such a set of functions to configure synch rep?
> >> I don't like that idea. Because it prevents us from configuring that
> >> while the server is not running.
> >
> > If you store a json blob in a set of files of PGDATA you could update
> > them manually there as well. That's perhaps re-inventing the wheel
> > with what is available with GUCs though.
>
> Why don't we just use GUC? If the quorum setting is not so complicated
> in real scenario, GUC seems enough for that.

I agree GUC would be enough.
We could also name groups in it.

I am thinking of the following format similar to JSON

<group_name>:<count> (<list>)
Use of square brackets for priority.

Ex:
s_s_names = 'remotes: 2 (london: 1 [lndn1, lndn2], nyc: 1[nyc1,nyc2])'

Regards,

Beena Emerson

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 06:30:11
Message-ID:	CAA4eK1Ln43gq4+y3jA3+bHXNo+F49=A2OQcxHpUXVmb7SZpyxg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 26, 2015 at 11:16 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:
>
> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote:
> > Let's start with a complex, fully described use case then work out how
to
> > specify what we want.
>
> Well, one of the most simple cases where quorum commit and this
> feature would be useful for is that, with 2 data centers:
> - on center 1, master A and standby B
> - on center 2, standby C and standby D
> With the current synchronous_standby_names, what we can do now is
> ensuring that one node has acknowledged the commit of master. For
> example synchronous_standby_names = 'B,C,D'. But you know that :)
> What this feature would allow use to do is for example being able to
> ensure that a node on the data center 2 has acknowledged the commit of
> master, meaning that even if data center 1 completely lost for a
> reason or another we have at least one node on center 2 that has lost
> no data at transaction commit.
>

I think the way to address this could be via SQL Syntax as that
will make users life easier.

Create Replication Setup Master A
Sync_Priority_Standby B Sync_Group_Any_Standby C,D
Sync_Group_Fixed_Standby 2,E,F,G

where
Sync_Priority_Standby - means same as current setting in
synchronous_standby_names

Sync_Group_Any_Standby - means if any one in the group has
acknowledged commit master can proceed

Sync_Group_Fixed_Standby - means fixed number
(that will be first parameter following this option) of standby's from this
group should commit before master can proceed.

The above syntax is just to explain the idea, but I think we can invent
better syntax if required. We can define these as options in syntax
like we do in some other syntaxes to avoid creating more keywords.
We need to ensure that all these option values needs to be persisted.

> Now, regarding the way to express that, we need to use a concept of
> node group for each element of synchronous_standby_names. A group
> contains a set of elements, each element being a group or a single
> node. And for each group we need to know three things when a commit
> needs to be acknowledged:
> - Does my group need to acknowledge the commit?
> - If yes, how many elements in my group need to acknowledge it?
> - Does the order of my elements matter?
>

I think with above kind of syntax we can address all these points
and even if something is remaining it is easily extendable.

> That's where the micro-language idea makes sense to use.

Micro-language idea is good, but I think if we can provide some
syntax or via SQL functions, then it can be convienient for users to
specify the replication topology.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 06:53:31
Message-ID:	CANP8+jLUBVZmvQCt9bQCTNpZ6Z9SjNRXucWNYeSAmEt+c1UaAw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7 July 2015 at 07:03, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:

> On Tue, Jul 7, 2015 at 2:19 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> > On 07/06/2015 09:56 PM, Michael Paquier wrote:
> >> On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >>> On 07/06/2015 06:40 PM, Michael Paquier wrote:
> >>>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh(at)agliodbs(dot)com>
> wrote:
> >>>>> pro-JSON:
> >>>>>
> >>>>> * standard syntax which is recognizable to sysadmins and devops.
> >>>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
> >>>>> additions/deletions from the synch rep config.
> >>>>> * can add group labels (see below)
> >>>>
> >>>> If we go this way, I think that managing a JSON blob with a GUC
> >>>> parameter is crazy, this is way longer in character size than a simple
> >>>> formula because of the key names. Hence, this JSON blob should be in a
> >>>> separate place than postgresql.conf not within the catalog tables,
> >>>> manageable using an SQL interface, and reloaded in backends using
> >>>> SIGHUP.
> >>>
> >>> I'm not following this at all. What are you saying here?
> >>
> >> A JSON string is longer in terms of number of characters than a
> >> formula because it contains key names, and those key names are usually
> >> repeated several times, making it harder to read in a configuration
> >> file. So what I am saying that that we do not save it as a GUC, but as
> >> a separate metadata that can be accessed with a set of SQL functions
> >> to manipulate it.
> >
> > Where, though? Someone already pointed out the issues with storing it
> > in a system catalog, and adding an additional .conf file with a
> > different format is too horrible to contemplate.
>
> Something like pg_syncinfo/ coupled with a LW lock, we already do
> something similar for replication slots with pg_replslot/.
>

-1 to pg_syncinfo/

pg_replslot has persistent state. We are discussing permanent configuration
data for which I don't see the need to create an additional parallel
infrastructure just to store a string given stated objection that the
string is fairly long. AFAICS its not even that long.

...

JSON seems the most sensible format for the string. Inventing a new one
doesn't make sense. Most important for me is the ability to
programmatically manipulate/edit the config string, which would be harder
with a new custom format.

...

Group labels are essential.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 06:58:59
Message-ID:	CANP8+jL8k-ELRYoC6JuGY2Kc=E2EWG6A9nFfAa4DNttO+Vhj+w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2 July 2015 at 19:50, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> So there's two parts to this:
>
> 1. I need to ensure that data is replicated to X places.
>
> 2. I need to *know* which places data was synchronously replicated to
> when the master goes down.
>
> My entire point is that (1) alone is useless unless you also have (2).
> And do note that I'm talking about information on the replica, not on
> the master, since in any failure situation we don't have the old master
> around to check.
>

You might *think* you know, but given we are in this situation because of
an unexpected failure, it seems strange to specifically avoid checking
before you proceed.

Bacon not Aristotle.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 07:01:23
Message-ID:	CANP8+jJnjPqtiCqyDkHawcmm-j=3SPYKqdpCrchhy83fnwwSkg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29 June 2015 at 18:40, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> I'm in favor of a more robust and sophisticated synch rep. But not if
> nobody not on this mailing list can configure it, and not if even we
> don't know what it will do in an actual failure situation.

That's the key point. Editing the config after a failure is a Failure of
Best Practice in an HA system.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 09:03:07
Message-ID:	CAB7nPqSAXswpon07-RCNcYhCODeH3=WMWejj4Vmq41rWu-LyJQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 15, 2015 at 3:53 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> pg_replslot has persistent state. We are discussing permanent configuration
> data for which I don't see the need to create an additional parallel
> infrastructure just to store a string given stated objection that the string
> is fairly long. AFAICS its not even that long.
>
> ...
>
> JSON seems the most sensible format for the string. Inventing a new one
> doesn't make sense. Most important for me is the ability to programmatically
> manipulate/edit the config string, which would be harder with a new custom
> format.
>
> ...
>
> Group labels are essential.

OK, so this is leading us to the following points:
- Use a JSON object to define the quorum/priority groups for the sync state.
- Store it as a GUC, and use the check hook to validate its format,
which is what we have now with s_s_names
- Rely on SIGHUP to maintain an in-memory image of the quorum/priority
sync state
- Have the possibility to define group labels in this JSON blob, and
be able to use those labels in a quorum or priority sync definition.
- For backward-compatibility, use for example s_s_names = 'json' to
switch to the new system.

Also, as a first step of the implementation, do we actually need a set
of functions to manipulate the JSON blob. I mean, we could perhaps
have them in contrib/ but they do not seem mandatory as long as we
document correctly how to document a label group and define a quorum
or priority group, no?
--
Michael

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 10:39:04
Message-ID:	CANP8+j+MdPWJjGLaStfW0tybJDJh56FR-GSu97046jodXzX6Ng@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15 July 2015 at 10:03, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:

> OK, so this is leading us to the following points:
> - Use a JSON object to define the quorum/priority groups for the sync
> state.
> - Store it as a GUC, and use the check hook to validate its format,
> which is what we have now with s_s_names
> - Rely on SIGHUP to maintain an in-memory image of the quorum/priority
> sync state
> - Have the possibility to define group labels in this JSON blob, and
> be able to use those labels in a quorum or priority sync definition.
>

> - For backward-compatibility, use for example s_s_names = 'json' to
> switch to the new system.
>

Seems easy enough to check to see if it is has a leading { and then treat
it as if it is an attempt to use JSON (which may fail), otherwise use the
old syntax.

> Also, as a first step of the implementation, do we actually need a set
> of functions to manipulate the JSON blob. I mean, we could perhaps
> have them in contrib/ but they do not seem mandatory as long as we
> document correctly how to document a label group and define a quorum
> or priority group, no?
>

Agreed, no specific functions needed to manipulate this field.

If we lack the means to manipulate JSON in SQL that can be solved outside
of the scope of this patch, because its just JSON.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 11:25:04
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:

> JSON seems the most sensible format for the string. Inventing a new one
> doesn't make sense. Most important for me is the ability to
> programmatically manipulate/edit the config string, which would be harder
> with a new custom format.

Do we need to keep the value consistent across all the servers in the
flock? If not, is the behavior halfway sane upon failover?

If we need the DBA to keep the value in sync manually, that's going to
be a recipe for trouble. Which is going to bite particularly hard
during those stressing moments when disaster strikes and things have to
be done in emergency mode.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-15 11:39:40
Message-ID:	CANP8+j+1op2Vi_GRe4LY6toz+qMwBT6TaY3PgARvUnowg72joA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15 July 2015 at 12:25, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:

> Simon Riggs wrote:
>
> > JSON seems the most sensible format for the string. Inventing a new one
> > doesn't make sense. Most important for me is the ability to
> > programmatically manipulate/edit the config string, which would be harder
> > with a new custom format.
>
> Do we need to keep the value consistent across all the servers in the
> flock? If not, is the behavior halfway sane upon failover?
>

Mostly, yes. Which means it doesn't change much, so config data is OK.

> If we need the DBA to keep the value in sync manually, that's going to
> be a recipe for trouble. Which is going to bite particularly hard
> during those stressing moments when disaster strikes and things have to
> be done in emergency mode.
>

Manual config itself is the recipe for trouble, not this particular
setting. There are already many other settings that need to be the same on
all nodes for example. Nothing here changes that. This is just an
enhancement of the current technology.

For the future, a richer mechanism for defining nodes and their associated
metadata is needed for logical replication and clustering. That is not what
is being discussed here though, nor should we begin!

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-16 17:27:53
Message-ID:	CA+TgmoZ0F=takLNEMXeKcdfx-H4Q27U=tBti65R-cMDfqg5fJA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 15, 2015 at 5:03 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>> Group labels are essential.
>
> OK, so this is leading us to the following points:
> - Use a JSON object to define the quorum/priority groups for the sync state.
> - Store it as a GUC, and use the check hook to validate its format,
> which is what we have now with s_s_names
> - Rely on SIGHUP to maintain an in-memory image of the quorum/priority
> sync state
> - Have the possibility to define group labels in this JSON blob, and
> be able to use those labels in a quorum or priority sync definition.
> - For backward-compatibility, use for example s_s_names = 'json' to
> switch to the new system.

Personally, I think we're going to find that using JSON for this
rather than a custom syntax makes the configuration strings two or
three times as long for no discernable benefit.

But I just work here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-16 17:32:46
Message-ID:	CANP8+jJw4vt+wfs-OEb-ZRbQ2FoWVw-aSAfBUyQdQH-71k3ywQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16 July 2015 at 18:27, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Jul 15, 2015 at 5:03 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> >> Group labels are essential.
> >
> > OK, so this is leading us to the following points:
> > - Use a JSON object to define the quorum/priority groups for the sync
> state.
> > - Store it as a GUC, and use the check hook to validate its format,
> > which is what we have now with s_s_names
> > - Rely on SIGHUP to maintain an in-memory image of the quorum/priority
> > sync state
> > - Have the possibility to define group labels in this JSON blob, and
> > be able to use those labels in a quorum or priority sync definition.
> > - For backward-compatibility, use for example s_s_names = 'json' to
> > switch to the new system.
>
> Personally, I think we're going to find that using JSON for this
> rather than a custom syntax makes the configuration strings two or
> three times as long for

They may well be 2-3 times as long. Why is that a negative?

> no discernable benefit.
>

Benefits:
* More readable
* Easy to validate
* No additional code required in the server to support this syntax (so no
bugs)
* Developers will immediately understand the format
* Easy to programmatically manipulate in a range of languages

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-16 17:40:42
Message-ID:	CA+TgmoYNHKv6qQiQdqdQEFnPoRZedOeOCSdcz05e7oScav+iCA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Personally, I think we're going to find that using JSON for this
>> rather than a custom syntax makes the configuration strings two or
>> three times as long for
>
> They may well be 2-3 times as long. Why is that a negative?

In my opinion, brevity makes things easier to read and understand. We
also don't support multi-line GUCs, so if your configuration takes 140
characters, you're going to have a very long line in your
postgresql.conf (and in your pg_settings output, etc.)

> * No additional code required in the server to support this syntax (so no
> bugs)

I think you'll find that this is far from true. Presumably not any
arbitrary JSON object will be acceptable. You'll have to parse it as
JSON, and then validate that it is of the expected form. It may not
be MORE code than implementing a mini-language from scratch, but I
wouldn't expect to save much.

> * Developers will immediately understand the format

I doubt it. I think any format that we pick will have to be carefully
documented. People may know what JSON looks like in general, but they
will not immediately know what bells and whistles are available in
this context.

> * Easy to programmatically manipulate in a range of languages

I agree that JSON has that advantage, but I doubt that it is important
here. I would expect that people might need to generate a new config
string and dump it into postgresql.conf, but that should be easy with
any reasonable format. I think it will be rare to need to parse the
postgresql.conf string, manipulate it programatically, and then put it
back. As we've already said, most configurations are simple and
shouldn't change frequently. If they're not or they do, that's a
problem of itself.

However, I'm not trying to ram my idea through; I'm just telling you my opinion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-17 03:20:14
Message-ID:	CAA4eK1JW91Bqo9XfHTtApka5DAJuWEj6xon0dox6zncsNbofjA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 16, 2015 at 11:10 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon(at)2ndquadrant(dot)com>
wrote:
>
> > * Developers will immediately understand the format
>
> I doubt it. I think any format that we pick will have to be carefully
> documented. People may know what JSON looks like in general, but they
> will not immediately know what bells and whistles are available in
> this context.
>

I also think any format where user has to carefully remember how he has
to provide the values is not user-friendly, why in this case SQL based
syntax is not preferable, with that we can even achieve consistency of this
parameter across all servers which I think is not of utmost importance
for this feature, but still I think it will make users happy.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-17 07:14:34
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
>
> On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon@> wrote:
> >> Personally, I think we're going to find that using JSON for this
> >> rather than a custom syntax makes the configuration strings two or
> >> three times as long for
> >
> > They may well be 2-3 times as long. Why is that a negative?
>
> In my opinion, brevity makes things easier to read and understand. We
> also don't support multi-line GUCs, so if your configuration takes 140
> characters, you're going to have a very long line in your
> postgresql.conf (and in your pg_settings output, etc.)
>
> > * No additional code required in the server to support this syntax (so
> no
> > bugs)
>
> I think you'll find that this is far from true. Presumably not any
> arbitrary JSON object will be acceptable. You'll have to parse it as
> JSON, and then validate that it is of the expected form. It may not
> be MORE code than implementing a mini-language from scratch, but I
> wouldn't expect to save much.
>
> > * Developers will immediately understand the format
>
> I doubt it. I think any format that we pick will have to be carefully
> documented. People may know what JSON looks like in general, but they
> will not immediately know what bells and whistles are available in
> this context.
>
> * Easy to programmatically manipulate in a range of languages
>
> I agree that JSON has that advantage, but I doubt that it is important
> here. I would expect that people might need to generate a new config
> string and dump it into postgresql.conf, but that should be easy with
> any reasonable format. I think it will be rare to need to parse the
> postgresql.conf string, manipulate it programatically, and then put it
> back. As we've already said, most configurations are simple and
> shouldn't change frequently. If they're not or they do, that's a
> problem of itself.
>

All points here are valid and I would prefer a new language over JSON. I
agree, the new validation code would have to be properly tested to avoid
bugs but it wont be too difficult.

Also I think methods that generate WAL record is avoided because any attempt
to change the syncrep settings will go in indefinite wait when a mandatory
sync candidate (as per current settings) goes down (Explained in earlier
post id: CAHGQGwE_-HCzw687B4SdMWqAkkPcu-uxmF3MKyDB9mu38cJ7Jg(at)mail(dot)gmail(dot)com)

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858255.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-17 23:36:55
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7/16/15 12:40 PM, Robert Haas wrote:
>> >They may well be 2-3 times as long. Why is that a negative?
> In my opinion, brevity makes things easier to read and understand. We
> also don't support multi-line GUCs, so if your configuration takes 140
> characters, you're going to have a very long line in your
> postgresql.conf (and in your pg_settings output, etc.)

Brevity goes both ways, but I don't think that's the real problem here;
it's the lack of multi-line support. The JSON that's been proposed makes
you work really hard to track what level of nesting you're at, while
every alternative format I've seen is terse enough to be very clear on a
single line.

I'm guessing it'd be really ugly/hard to support at least this GUC being
multi-line?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-19 19:46:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/17/2015 04:36 PM, Jim Nasby wrote:
> On 7/16/15 12:40 PM, Robert Haas wrote:
>>> >They may well be 2-3 times as long. Why is that a negative?
>> In my opinion, brevity makes things easier to read and understand. We
>> also don't support multi-line GUCs, so if your configuration takes 140
>> characters, you're going to have a very long line in your
>> postgresql.conf (and in your pg_settings output, etc.)
>
> Brevity goes both ways, but I don't think that's the real problem here;
> it's the lack of multi-line support. The JSON that's been proposed makes
> you work really hard to track what level of nesting you're at, while
> every alternative format I've seen is terse enough to be very clear on a
> single line.

I will point out that the proposed non-JSON syntax does not offer any
ability to name consensus/priority groups. I believe that being able to
name groups is vital to managing any complex synch rep, but if we add
names it will make the non-JSON syntax less compact.

>
> I'm guessing it'd be really ugly/hard to support at least this GUC being
> multi-line?

Yes.

Mind you, multi-line GUCs would be useful otherwise, but we don't want
to hinge this feature on making that work.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-19 20:16:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> On 07/17/2015 04:36 PM, Jim Nasby wrote:
>> I'm guessing it'd be really ugly/hard to support at least this GUC being
>> multi-line?

> Mind you, multi-line GUCs would be useful otherwise, but we don't want
> to hinge this feature on making that work.

I'm pretty sure that changing the GUC parser to allow quoted strings to
continue across lines would be trivial. The problem with it is not that
it's hard, it's that omitting a closing quote mark would then result in
the entire file being syntactically broken, with the error message(s)
almost certainly pointing somewhere else than where the actual mistake is.
Do we really want such a global reduction in friendliness to make this
feature easier?

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-19 21:21:12
Message-ID:	CANP8+j+wANkAYPuXAHiA0922WBL-V=1q=sn_HU_sWUMhhKeGAg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19 July 2015 at 21:16, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > On 07/17/2015 04:36 PM, Jim Nasby wrote:
> >> I'm guessing it'd be really ugly/hard to support at least this GUC being
> >> multi-line?
>
> > Mind you, multi-line GUCs would be useful otherwise, but we don't want
> > to hinge this feature on making that work.
>
> I'm pretty sure that changing the GUC parser to allow quoted strings to
> continue across lines would be trivial.

Agreed

> The problem with it is not that
> it's hard, it's that omitting a closing quote mark would then result in
> the entire file being syntactically broken, with the error message(s)
> almost certainly pointing somewhere else than where the actual mistake is.
>

That depends upon how we specify line-continuation. If we do it with
starting and ending quotes, then we would have the problem you suggest. If
we required each new continuation line to start with a \ then it wouldn't
(or similar). Or perhaps it gets its own file even, an idea raised before.

Do we really want such a global reduction in friendliness to make this
> feature easier?
>

Clearly not, but we must first decide whether that is how we characterise
the decision.

synchronous_standby_name= is already 25 characters, so that leaves 115
characters - are they always single byte chars?

It's not black and white for me that JSON necessarily requires >115 chars
whereas other ways never will do.

What we are discussing is expanding an existing parameter to include more
information. If Josh gets some of the things he's been asking for, then the
format will bloat further. It doesn't take much for me to believe it might
expand further still, so my view from the discussion is that we'll likely
need to expand beyond 115 chars one day whatever format we choose.

I'm personally ambivalent what the exact format is that we choose; I care
much more about the feature than the syntax, always. My contribution so far
was to summarise what I thought was the majority opinion, and to challenge
the thought that JSON had no discernible benefit. If the majority view is
different, I have no problem there.

Clusters of 20 or more standby nodes are reasonably common, so those limits
do seem a little small. Synchronous commit behavior is far from being the
only cluster metadata we need to record. I'm thinking now that this
illustrates that this is the wrong way altogether and we should just be
storing cluster metadata in database tables, which is what was discussed
and agreed at the BDR meeting at PGCon.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-20 07:18:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:

>synchronous_standby_name= is already 25 characters, so that leaves 115
characters - are they always single byte chars?

I am sorry, I did not get why there is a 140 byte limit. Can you please
explain?

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858502.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-20 08:05:19
Message-ID:	CANP8+j+xj4FaUSoYdS4SkShmdJw0meF28fw4Pf75xttRp1Y6qQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20 July 2015 at 08:18, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:

> Simon Riggs wrote:
>
> >synchronous_standby_name= is already 25 characters, so that leaves 115
> characters - are they always single byte chars?
>
> I am sorry, I did not get why there is a 140 byte limit. Can you please
> explain?
>

Hmm, sorry, I thought Robert had said there was a 140 byte limit. I misread.

I don't think that affects my point. The choice between formats is not
solely predicated on whether we have multi-line support.

I still think writing down some actual use cases would help bring the
discussion to a conclusion. Inventing a general facility is hard without
some clear goals about what we need to support.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-20 12:59:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:

> The choice between formats is not
> solely predicated on whether we have multi-line support.

> I still think writing down some actual use cases would help bring the
> discussion to a conclusion. Inventing a general facility is hard without
> some clear goals about what we need to support.

We need to at least support the following:
- Grouping: Specify of standbys along with the minimum number of commits
required from the group.
- Group Type: Groups can either be priority or quorum group.
- Group names: to simplify status reporting
- Nesting: At least 2 levels of nesting

Using JSON, sync rep parameter to replicate in 2 different clusters could be
written as:

{"remotes":
{"quorum": 2,
"servers": [{"london":
{"prioirty": 2,
"servers": ["lndn1", "lndn2", "lndn3"]
}}
,
{"nyc":
{"priority": 1,
"servers": ["ny1", "ny2"]
}}
]
}
}

The same parameter in the new language (as suggested above) could be written
as:
'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'

Also, I was thinking the name of the main group could be optional.
Internally, it can be given the name 'default group' or 'main group' for
status reporting.

The above could also be written as:
'2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'

backward compatible:
In JSON, while validating we may have to check if it starts with '{' to go
for JSON parsing else proceed with the current method.

A,B,C => 1[A,B,C]. This can be added in the new parser code.

Thoughts?

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858571.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-21 06:50:59
Message-ID:	CAB7nPqSmcVa4u=gdM=V6G1Nciqcr1HfAVKMmX_inOLr5E4suQA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> Simon Riggs wrote:
>
>> The choice between formats is not
>> solely predicated on whether we have multi-line support.
>
>> I still think writing down some actual use cases would help bring the
>> discussion to a conclusion. Inventing a general facility is hard without
>> some clear goals about what we need to support.
>
> We need to at least support the following:
> - Grouping: Specify of standbys along with the minimum number of commits
> required from the group.
> - Group Type: Groups can either be priority or quorum group.

As far as I understood at the lowest level a group is just an alias
for a list of nodes, quorum or priority are properties that can be
applied to a group of nodes when this group is used in the expression
to define what means synchronous commit.

> - Group names: to simplify status reporting
> - Nesting: At least 2 levels of nesting

If I am following correctly, at the first level there is the
definition of the top level objects, like groups and sync expression.

> Using JSON, sync rep parameter to replicate in 2 different clusters could be
> written as:
>
> {"remotes":
> {"quorum": 2,
> "servers": [{"london":
> {"priority": 2,
> "servers": ["lndn1", "lndn2", "lndn3"]
> }}
> ,
> {"nyc":
> {"priority": 1,
> "servers": ["ny1", "ny2"]
> }}
> ]
> }
> }
> The same parameter in the new language (as suggested above) could be written
> as:
> 'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'

OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3],
nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases,
I think that JSON makes the most sense. One of the advantage of a
group is that you can use it in several places in the blob and set
different properties into it, hence we should be able to define a
group out of the sync expression.

Hence I would think that something like that makes more sense:
{
"sync_standby_names":
{
"quorum":2,
"nodes":
[
{"priority":1,"group":"cluster1"},
{"quorum":2,"nodes":["node1","node2","node3"]}
]
},
"groups":
{
"cluster1":["node11","node12","node13"],
"cluster2":["node21","node22","node23"]
}
}

> Also, I was thinking the name of the main group could be optional.
> Internally, it can be given the name 'default group' or 'main group' for
> status reporting.
>
> The above could also be written as:
> '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>
> backward compatible:
> In JSON, while validating we may have to check if it starts with '{' to go

Something worth noticing, application_name can begin with "{".

> for JSON parsing else proceed with the current method.

> A,B,C => 1[A,B,C]. This can be added in the new parser code.

This makes sense. We could do the same for JSON-based format as well
by reusing the in-memory structure used to deparse the blob when the
former grammar is used as well.
--
Michael

From:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-29 12:03:39
Message-ID:	CAD21AoCSajf_p2GV4veg5k5cBr_=zxJzCLU=zZGrZHWgF2TX6g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 21, 2015 at 3:50 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>> Simon Riggs wrote:
>>
>>> The choice between formats is not
>>> solely predicated on whether we have multi-line support.
>>
>>> I still think writing down some actual use cases would help bring the
>>> discussion to a conclusion. Inventing a general facility is hard without
>>> some clear goals about what we need to support.
>>
>> We need to at least support the following:
>> - Grouping: Specify of standbys along with the minimum number of commits
>> required from the group.
>> - Group Type: Groups can either be priority or quorum group.
>
> As far as I understood at the lowest level a group is just an alias
> for a list of nodes, quorum or priority are properties that can be
> applied to a group of nodes when this group is used in the expression
> to define what means synchronous commit.
>
>> - Group names: to simplify status reporting
>> - Nesting: At least 2 levels of nesting
>
> If I am following correctly, at the first level there is the
> definition of the top level objects, like groups and sync expression.
>

The grouping and using same application_name different server is similar.
How does the same application_name different server work?

>> Using JSON, sync rep parameter to replicate in 2 different clusters could be
>> written as:
>>
>> {"remotes":
>> {"quorum": 2,
>> "servers": [{"london":
>> {"priority": 2,
>> "servers": ["lndn1", "lndn2", "lndn3"]
>> }}
>> ,
>> {"nyc":
>> {"priority": 1,
>> "servers": ["ny1", "ny2"]
>> }}
>> ]
>> }
>> }
>> The same parameter in the new language (as suggested above) could be written
>> as:
>> 'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>
> OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3],
> nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases,
> I think that JSON makes the most sense. One of the advantage of a
> group is that you can use it in several places in the blob and set
> different properties into it, hence we should be able to define a
> group out of the sync expression.
> Hence I would think that something like that makes more sense:
> {
> "sync_standby_names":
> {
> "quorum":2,
> "nodes":
> [
> {"priority":1,"group":"cluster1"},
> {"quorum":2,"nodes":["node1","node2","node3"]}
> ]
> },
> "groups":
> {
> "cluster1":["node11","node12","node13"],
> "cluster2":["node21","node22","node23"]
> }
> }
>
>> Also, I was thinking the name of the main group could be optional.
>> Internally, it can be given the name 'default group' or 'main group' for
>> status reporting.
>>
>> The above could also be written as:
>> '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>>
>> backward compatible:
>> In JSON, while validating we may have to check if it starts with '{' to go
>
> Something worth noticing, application_name can begin with "{".
>
>> for JSON parsing else proceed with the current method.
>
>> A,B,C => 1[A,B,C]. This can be added in the new parser code.
>
> This makes sense. We could do the same for JSON-based format as well
> by reusing the in-memory structure used to deparse the blob when the
> former grammar is used as well.

If I validate s_s_name JSON syntax, I will definitely use JSONB,
rather than JSON.
Because JSONB has some useful operation functions for adding node,
deleting node to s_s_name today.
But the down side of using JSONB for s_s_name is that it could switch
in key name order place.(and remove duplicate key)
For example in the syntax Michael suggested,

* JSON (just casting JSON)
json
------------------------------------------------------------------------
{ +
"sync_standby_names": +
{ +
"quorum":2, +
"nodes": +
[ +
{"priority":1,"group":"cluster1"}, +
{"quorum":2,"nodes":["node1","node2","node3"]}+
] +
}, +
"groups": +
{ +
"cluster1":["node11","node12","node13"], +
"cluster2":["node21","node22","node23"] +
} +
}

* JSONB (using jsonb_pretty)
jsonb_pretty
--------------------------------------
{ +
"groups": { +
"cluster1": [ +
"node11", +
"node12", +
"node13" +
], +
"cluster2": [ +
"node21", +
"node22", +
"node23" +
] +
}, +
"sync_standby_names": { +
"nodes": [ +
{ +
"group": "cluster1",+
"priority": 1 +
}, +
{ +
"nodes": [ +
"node1", +
"node2", +
"node3" +
], +
"quorum": 2 +
} +
], +
"quorum": 2 +
} +
}

"group" and "sync_standby_names" has been switched place. I'm not sure
it's good for the users.

Regards,

--
Masahiko Sawada

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-29 12:28:37
Message-ID:	CAB7nPqSU=07zYBU-mEyiQOHpcTmBY=0VbJd9GjGCWzEVB9E0KQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 29, 2015 at 9:03 PM, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Tue, Jul 21, 2015 at 3:50 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>>> Simon Riggs wrote:
>>>
>>>> The choice between formats is not
>>>> solely predicated on whether we have multi-line support.
>>>
>>>> I still think writing down some actual use cases would help bring the
>>>> discussion to a conclusion. Inventing a general facility is hard without
>>>> some clear goals about what we need to support.
>>>
>>> We need to at least support the following:
>>> - Grouping: Specify of standbys along with the minimum number of commits
>>> required from the group.
>>> - Group Type: Groups can either be priority or quorum group.
>>
>> As far as I understood at the lowest level a group is just an alias
>> for a list of nodes, quorum or priority are properties that can be
>> applied to a group of nodes when this group is used in the expression
>> to define what means synchronous commit.
>>
>>> - Group names: to simplify status reporting
>>> - Nesting: At least 2 levels of nesting
>>
>> If I am following correctly, at the first level there is the
>> definition of the top level objects, like groups and sync expression.
>>
>
> The grouping and using same application_name different server is similar.
> How does the same application_name different server work?

In the same of a priority group both nodes get the same priority,
imagine for example that we need to wait for 2 nodes with lower
priority: node1 with priority 1, node2 with priority 2 and again node2
with priority 2, we would wait for the first one, and then one of the
second. In quorum group, any of them could be qualified for selection.

>>> Using JSON, sync rep parameter to replicate in 2 different clusters could be
>>> written as:
>>>
>>> {"remotes":
>>> {"quorum": 2,
>>> "servers": [{"london":
>>> {"priority": 2,
>>> "servers": ["lndn1", "lndn2", "lndn3"]
>>> }}
>>> ,
>>> {"nyc":
>>> {"priority": 1,
>>> "servers": ["ny1", "ny2"]
>>> }}
>>> ]
>>> }
>>> }
>>> The same parameter in the new language (as suggested above) could be written
>>> as:
>>> 'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>>
>> OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3],
>> nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases,
>> I think that JSON makes the most sense. One of the advantage of a
>> group is that you can use it in several places in the blob and set
>> different properties into it, hence we should be able to define a
>> group out of the sync expression.
>> Hence I would think that something like that makes more sense:
>> {
>> "sync_standby_names":
>> {
>> "quorum":2,
>> "nodes":
>> [
>> {"priority":1,"group":"cluster1"},
>> {"quorum":2,"nodes":["node1","node2","node3"]}
>> ]
>> },
>> "groups":
>> {
>> "cluster1":["node11","node12","node13"],
>> "cluster2":["node21","node22","node23"]
>> }
>> }
>>
>>> Also, I was thinking the name of the main group could be optional.
>>> Internally, it can be given the name 'default group' or 'main group' for
>>> status reporting.
>>>
>>> The above could also be written as:
>>> '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>>>
>>> backward compatible:
>>> In JSON, while validating we may have to check if it starts with '{' to go
>>
>> Something worth noticing, application_name can begin with "{".
>>
>>> for JSON parsing else proceed with the current method.
>>
>>> A,B,C => 1[A,B,C]. This can be added in the new parser code.
>>
>> This makes sense. We could do the same for JSON-based format as well
>> by reusing the in-memory structure used to deparse the blob when the
>> former grammar is used as well.
>
> If I validate s_s_name JSON syntax, I will definitely use JSONB,
> rather than JSON.
> Because JSONB has some useful operation functions for adding node,
> deleting node to s_s_name today.
> But the down side of using JSONB for s_s_name is that it could switch
> in key name order place.(and remove duplicate key)
> For example in the syntax Michael suggested,
> [...]
> "group" and "sync_standby_names" has been switched place. I'm not sure
> it's good for the users.

I think that's perfectly fine.
--
Michael

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-30 05:16:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Just looking at how the 2 differnt methods can be used to set the s_s_names
value.

1. For a simple case where quorum is required for a single group the JSON
could be:

{
"sync_standby_names":
{
"quorum":2,
"nodes":
[ "node1","node2","node3" ]
}
}

{
"sync_standby_names":
{
"quorum":2,
"group": "cluster1"
},
"groups":
{
"cluster1":["node1","node2","node3"]
}
}

Language:
2(node1, node2, node3)

2. For having quorum between different groups and node:
{
"sync_standby_names":
{
"quorum":2,
"nodes":
[
{"priority":1,"nodes":["node0"]},
{"quorum":2,"group": "cluster1"}
]
},
"groups":
{
"cluster1":["node1","node2","node3"]
}
}

or
{
"sync_standby_names":
{
"quorum":2,
"nodes":
[
{"priority":1,"group": "cluster2"},
{"quorum":2,"group": "cluster1"}
]
},
"groups":
{
"cluster1":["node1","node2","node3"],
"cluster2":["node0"]
}
}

Language:
2 (node0, cluster1: 2(node1, node2, node3))

Since there will not be many nesting and grouping, I still prefer new
language to JSON.
I understand one can easily, modify/add groups in JSON using in built
functions but I think changes will not be done too often.

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860197.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-07-30 20:04:41
Message-ID:	CA+TgmoZFM_9OOnJWc83aLNVpwqd0gnOhndTSmRmiF6umirVTNw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jul 19, 2015 at 4:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> On 07/17/2015 04:36 PM, Jim Nasby wrote:
>>> I'm guessing it'd be really ugly/hard to support at least this GUC being
>>> multi-line?
>
>> Mind you, multi-line GUCs would be useful otherwise, but we don't want
>> to hinge this feature on making that work.
>
> I'm pretty sure that changing the GUC parser to allow quoted strings to
> continue across lines would be trivial. The problem with it is not that
> it's hard, it's that omitting a closing quote mark would then result in
> the entire file being syntactically broken, with the error message(s)
> almost certainly pointing somewhere else than where the actual mistake is.
> Do we really want such a global reduction in friendliness to make this
> feature easier?

Maybe shoehorning this into the GUC mechanism is the wrong thing, and
what we really need is a new config file for this. The information
we're proposing to store seems complex enough to justify that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-08-04 05:57:12
Message-ID:	CAD21AoD50zOocwUg2C1EEEwsgsvngd8cysVVXxChDqpUO6ykOQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> Hello,
>
> Just looking at how the 2 differnt methods can be used to set the s_s_names
> value.
>
> 1. For a simple case where quorum is required for a single group the JSON
> could be:
>
> {
> "sync_standby_names":
> {
> "quorum":2,
> "nodes":
> [ "node1","node2","node3" ]
> }
> }
>
> or
>
> {
> "sync_standby_names":
> {
> "quorum":2,
> "group": "cluster1"
> },
> "groups":
> {
> "cluster1":["node1","node2","node3"]
> }
> }
>
> Language:
> 2(node1, node2, node3)
>
>
> 2. For having quorum between different groups and node:
> {
> "sync_standby_names":
> {
> "quorum":2,
> "nodes":
> [
> {"priority":1,"nodes":["node0"]},
> {"quorum":2,"group": "cluster1"}
> ]
> },
> "groups":
> {
> "cluster1":["node1","node2","node3"]
> }
> }
>
> or
> {
> "sync_standby_names":
> {
> "quorum":2,
> "nodes":
> [
> {"priority":1,"group": "cluster2"},
> {"quorum":2,"group": "cluster1"}
> ]
> },
> "groups":
> {
> "cluster1":["node1","node2","node3"],
> "cluster2":["node0"]
> }
> }
>
> Language:
> 2 (node0, cluster1: 2(node1, node2, node3))
>
> Since there will not be many nesting and grouping, I still prefer new
> language to JSON.
> I understand one can easily, modify/add groups in JSON using in built
> functions but I think changes will not be done too often.
>

If we decided to use dedicated language, the syntax checker for that
language is needed, via SQL or something.
Otherwise we will not be able to know whether the parsing that value
will be done correctly, until reloading or restarting server.

Regards,

--
Masahiko Sawada

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-08-04 06:05:21
Message-ID:	CAB7nPqRxhoz61LLFifJLAox6ixoU0a23i9vaGGzGocT3vJ5KPw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 4, 2015 at 2:57 PM, Masahiko Sawada wrote:
> On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson wrote:
>> Since there will not be many nesting and grouping, I still prefer new
>> language to JSON.
>> I understand one can easily, modify/add groups in JSON using in built
>> functions but I think changes will not be done too often.
>>
>
> If we decided to use dedicated language, the syntax checker for that
> language is needed, via SQL or something.

Well, sure, both approaches have downsides.

> Otherwise we will not be able to know whether the parsing that value
> will be done correctly, until reloading or restarting server.

And this is the case of any format as well. String format validation
for a GUC occurs when server is reloaded or restarted, one advantage
of JSON is that the parser validator is already here, so we don't need
to reinvent a new machinery for that.
--
Michael

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-08-04 06:27:35
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Paquier wrote:
> And this is the case of any format as well. String format validation
> for a GUC occurs when server is reloaded or restarted, one advantage
> of JSON is that the parser validator is already here, so we don't need
> to reinvent a new machinery for that.

IIUC correctly, we would also have to add additional code to check that that
given JSON has the required keys and entries. For ex: The "group" mentioned
in the "s_s_names" should be definied in the "groups" section, etc.

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860758.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-08-04 06:34:24
Message-ID:	CAB7nPqRr=_C8Z4JOUaF5LfX3k8jB1nn9tAjyBUgaWvV-6ev-6Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 4, 2015 at 3:27 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> Michael Paquier wrote:
>> And this is the case of any format as well. String format validation
>> for a GUC occurs when server is reloaded or restarted, one advantage
>> of JSON is that the parser validator is already here, so we don't need
>> to reinvent a new machinery for that.
>
> IIUC correctly, we would also have to add additional code to check that that
> given JSON has the required keys and entries. For ex: The "group" mentioned
> in the "s_s_names" should be definied in the "groups" section, etc.

Yep, true as well.
--
Michael

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-08-04 11:37:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
>Maybe shoehorning this into the GUC mechanism is the wrong thing, and
>what we really need is a new config file for this. The information
>we're proposing to store seems complex enough to justify that.
>

I think the consensus is that JSON is better.
And using a new file with multi line support would be good.

Name of the file: how about pg_syncinfo.conf?

Backward compatibility: synchronous_standby_names will be supported.
synchronous_standby_names='pg_syncinfo' indicates use of new file.

JSON format:
It would contain 2 main keys: "sync_info" and "groups"
The "sync_info" would consist of "quorum"/"priority" with the count and
"nodes"/"group" with the group name or node list.
The optional "groups" key would list out all the "group" mentioned within
"sync_info" along with the node list.

Ex:
1.
{
"sync_info":
{
"quorum":2,
"nodes":
[
"node1", "node2", "node3"
]
}
}

2.
{
"sync_info":
{
"quorum":2,
"nodes":
[
{"priority":1,"group":"cluster1"},
{"quorum":2,"group": "cluster2"},
"node99"
]
},
"groups":
{
"cluster1":["node11","node12"],
"cluster2":["node21","node22","node23"]
}
}

Thoughts?

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860791.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-08-05 02:18:26
Message-ID:	CAB7nPqSj-KDYhd7m8JhP1XNmJ7HN96ebsB2eapY=EmV0CVyBag@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 4, 2015 at 8:37 PM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> Robert Haas wrote:
>>Maybe shoehorning this into the GUC mechanism is the wrong thing, and
>>what we really need is a new config file for this. The information
>>we're proposing to store seems complex enough to justify that.
>>
>
> I think the consensus is that JSON is better.

I guess so as well. Thanks for brainstorming the whole thread in a single post.

> And using a new file with multi line support would be good.

This file just contains a JSON blob, hence we just need to fetch its
content entirely and then let the server parse it using the existing
facilities.

> Name of the file: how about pg_syncinfo.conf?
> Backward compatibility: synchronous_standby_names will be supported.
> synchronous_standby_names='pg_syncinfo' indicates use of new file.

This strengthens the fact that parsing is done at SIGHUP, so that
sounds fine to me. We may still find out an application_name that uses
pg_syncinfo but well, that's unlikely to happen...

> JSON format:
> It would contain 2 main keys: "sync_info" and "groups"
> The "sync_info" would consist of "quorum"/"priority" with the count and
> "nodes"/"group" with the group name or node list.
> The optional "groups" key would list out all the "group" mentioned within
> "sync_info" along with the node list.
>
> [...]
>
> Thoughts?

Yes, I think that's the idea. I would let a couple of days to let
people time to give their opinion and objections regarding this
approach though.
--
Michael

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-08-27 22:06:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 1, 2015 at 11:21:47AM -0700, Josh Berkus wrote:
> All:
>
> Replying to multiple people below.
>
> On 07/01/2015 07:15 AM, Fujii Masao wrote:
> > On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >> You're confusing two separate things. The primary manageability problem
> >> has nothing to do with altering the parameter. The main problem is: if
> >> there is more than one synch candidate, how do we determine *after the
> >> master dies* which candidate replica was in synch at the time of
> >> failure? Currently there is no way to do that. This proposal plans to,
> >> effectively, add more synch candidate configurations without addressing
> >> that core design failure *at all*. That's why I say that this patch
> >> decreases overall reliability of the system instead of increasing it.
> >
> > I agree this is a problem even today, but it's basically independent from
> > the proposed feature *itself*. So I think that it's better to discuss and
> > work on the problem separately. If so, we might be able to provide
> > good way to find new master even if the proposed feature finally fails
> > to be adopted.
>
> I agree that they're separate features. My argument is that the quorum
> synch feature isn't materially useful if we don't create some feature to
> identify which server(s) were in synch at the time the master died.

I am coming in here late, but I thought the last time we talked about
this that the only reasonable way to communicate that we have changed to
synchronize with a secondary server (different application_name) is to
allow a GUC-configured command string to be run when a change like this
happens. The command string would write a status on another server or
send an email.

Based on the new s_s_name API, this would mean whenever we switch to a
different priority level, like 1 to 2, 2 to 3, or 2 to 1.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

From:	Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-08-28 01:42:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8/4/15 9:18 PM, Michael Paquier wrote:
>> >And using a new file with multi line support would be good.
> This file just contains a JSON blob, hence we just need to fetch its
> content entirely and then let the server parse it using the existing
> facilities.

It sounds like there's other places where multiline GUCs would be
useful, so I think we should just support that instead of creating
something that only works for SR configuration.

I also don't see the problem with supporting multi-line GUCs that are
wrapped in quotes. Yes, you miss a quote and things blow up, but so
what? Anyone that's done any amount of programming has faced that
problem. Heck, if we wanted to be fancy we could watch for the first
line that could have been another GUC and stick that in a hint.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-09-10 18:41:07
Message-ID:	CAOG9ApE4idqgX1mxawDiVEYRiSgPQk4o2rMjoqJFepL3rjgiNQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Please find attached the WIP patch for the proposed feature. It is built
based on the already discussed design.

Changes made:
- add new parameter "sync_file" to provide the location of the pg_syncinfo
file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
pg_ident file.
- pg_syncinfo file will hold the sync rep information in the approved JSON
format.
- synchronous_standby_names can be set to 'pg_syncinfo.conf' to read the
JSON value stored in the file.
- All the standbys mentioned in the s_s_names or the pg_syncinfo file
currently get the priority as 1 and all others as 0 (async)
- Various functions in syncrep.c to read the json file and store the values
in a struct to be used in checking the quorum status of syncrep standbys
(SyncRepGetQuorumRecPtr function).

It does not support the current behavior for synchronous_standby_names =
'*'. I am yet to thoroughly test the patch.

Thoughts?

--
Beena Emerson

Attachment	Content-Type	Size
WIP_multiple_syncrep.patch	application/octet-stream	26.0 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-09-11 01:15:00
Message-ID:	CAB7nPqQ2WAcPAR8OmgVwnx5ZRu4t2mn7LuG63bKBAgpnUDPTdw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 11, 2015 at 3:41 AM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
> Please find attached the WIP patch for the proposed feature. It is built
> based on the already discussed design.
>
> Changes made:
> - add new parameter "sync_file" to provide the location of the pg_syncinfo
> file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
> pg_ident file.

I am not sure that's really necessary. We could just hardcode its location.

> - pg_syncinfo file will hold the sync rep information in the approved JSON
> format.

OK. Have you considered as well the approach to add support for
multi-line GUC parameters? This has been mentioned a couple of time
above as well, with something like that I imagine:
param = 'value1,' \
'value2,' \
'value3'
and this reads as 'value1,value2,value3'. This would benefit as well
for other parameters.

> - synchronous_standby_names can be set to 'pg_syncinfo.conf' to read the
> JSON value stored in the file.

Check.

> - All the standbys mentioned in the s_s_names or the pg_syncinfo file
> currently get the priority as 1 and all others as 0 (async)
> - Various functions in syncrep.c to read the json file and store the values
> in a struct to be used in checking the quorum status of syncrep standbys
> (SyncRepGetQuorumRecPtr function).
> It does not support the current behavior for synchronous_standby_names = '*'.
> I am yet to thoroughly test the patch.

As this patch adds a whole new infrastructure, this is going to need
complex test setups with many configurations that will require either
bash-ing a bunch of new things, and we are not protected from bugs in
those scripts either or manual manipulation mistakes during the tests.
What I think looks really necessary with this patch is to have
included a set of tests to prove that the patch actually does what it
should with complex scenarios and that it does it correctly. So we had
better perhaps move on with this patch first:
https://commitfest.postgresql.org/6/197/

And it would be really nice to get the tests of this patch integrated
with it as well. We are not protected from bugs in this patch as well,
but if we have an infrastructure centralized this will add a level of
confidence that we are doing things the right way. Your patch offers
as well a good occasion to see if there would be some generic routines
that would be helpful in this recovery test suite.
Regards,
--
Michael

From:	Sameer Thakur-2 <Sameer(dot)Thakur(at)nttdata(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-09-12 08:32:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,
I did apply the patch to HEAD and tried to setup basic async replication.But
i got an error. Turned on logging for details below.

Unpatched Primary Log
LOG: database system was shut down at 2015-09-12 13:41:40 IST
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

Unpatched Standby log
LOG: entering standby mode
LOG: redo starts at 0/2000028
LOG: invalid record length at 0/20000D0
LOG: started streaming WAL from primary at 0/2000000 on timeline 1
LOG: consistent recovery state reached at 0/20000F8
LOG: database system is ready to accept read only connections

Patched Primary log
LOG: database system was shut down at 2015-09-12 13:50:17 IST
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: server process (PID 17317) was terminated by signal 11: Segmentation
fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2015-09-12 13:50:18
IST
FATAL: the database system is in recovery mode
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: invalid record length at 0/3000098
LOG: redo is not required
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: server process (PID 17343) was terminated by signal 11: Segmentation
fault
LOG: terminating any other active server processes

Patched Standby log
LOG: database system was interrupted; last known up at 2015-09-12 13:50:16
IST
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
LOG: entering standby mode
LOG: redo starts at 0/2000028
LOG: invalid record length at 0/20000D0
LOG: started streaming WAL from primary at 0/2000000 on timeline 1
FATAL: could not receive data from WAL stream: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

FATAL: could not connect to the primary server: FATAL: the database system
is in recovery mode

Not sure if there is something i am missing which causes this.
regards
Sameer

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5865685.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-09-15 03:19:44
Message-ID:	CAEepm=1sM+eHvfehoGzw-Qa_QmVFHnNV7cQyU-BMNZUFmd8ODg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 11, 2015 at 6:41 AM, Beena Emerson <memissemerson(at)gmail(dot)com>
wrote:

> Hello,
>
> Please find attached the WIP patch for the proposed feature. It is built
> based on the already discussed design.
>
> Changes made:
> - add new parameter "sync_file" to provide the location of the pg_syncinfo
> file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
> pg_ident file.
> - pg_syncinfo file will hold the sync rep information in the approved JSON
> format.
> - synchronous_standby_names can be set to 'pg_syncinfo.conf' to read the
> JSON value stored in the file.
> - All the standbys mentioned in the s_s_names or the pg_syncinfo file
> currently get the priority as 1 and all others as 0 (async)
> - Various functions in syncrep.c to read the json file and store the
> values in a struct to be used in checking the quorum status of syncrep
> standbys (SyncRepGetQuorumRecPtr function).
>
> It does not support the current behavior for synchronous_standby_names =
> '*'. I am yet to thoroughly test the patch.
>
> Thoughts?
>

This is a great feature, thanks for working on it!

Here is some initial feedback after a quick eyeballing of the patch and a
couple of test runs. I will have more soon after I figure out how to
really test it and try out the configuration system...

It crashes when async standbys connect, as already reported by Sameer
Thakur. It doesn't crash with this change:

@@ -700,6 +700,9 @@ SyncRepGetStandbyPriority(void)
if (am_cascading_walsender)
return 0;

+ if (SyncRepStandbyInfo == NULL)
+ return 0;
+
if (CheckNameList(SyncRepStandbyInfo, application_name, false))
return 1;

I got the following error from clang-602.0.53 on my Mac:

walsender.c:1955:11: error: passing 'char volatile[8192]' to parameter of
type 'void *' discards qualifiers
[-Werror,-Wincompatible-pointer-types-discards-qualifiers]
memcpy(walsnd->name, application_name,
strlen(application_name));
^~~~~~~~~~~~

I think your memcpy and explicit null termination could be replaced with
strcpy, or maybe something to limit buffer overrun damage in case of sizing
bugs elsewhere. But to get rid of that warning you'd still need to cast
away volatile... I note that you do that in SyncRepGetQuorumRecPtr when
you read the string with strcmp. But is that actually safe, with respect
to load/store reordering around spinlock operations? Do we actually need
volatile-preserving cstring copy and compare functions for this type of
thing?

In walsender_private.h:

+#define MAX_APPLICATION_NAME_LEN 8192

What is the basis for this size? application_name is a GUC with
GUC_IS_NAME set. As far as I can see, it's limited to NAMEDATALEN
(including null terminator), so why not use the exact same buffer size?

In load_syncinfo:

+ len = strlen(standby_name);
+ temp->name = malloc(len);
+ memcpy(temp->name, standby_name, len);
+ temp->name[len] = '\0';

This buffer is one byte too short, and doesn't handle malloc failure. And
generally, this code is equivalent to strdup, and could instead be pstrdup
(which raises an error on allocation failure for free). But I'm not sure
which memory context is appropriate and when this should be freed.

Same problem in sync_info_scalar:

+ state->cur_node->name = (char *)
malloc(len);
+ memcpy(state->cur_node->name, token,
strlen(token));
+ state->cur_node->name[len] = '\0';

In SyncRepGetQuorumRecPtr, some extra curly braces:

+ if (node->next)
+ {
+ SyncRepGetQuorumRecPtr(node->next, lsnlist,
node->priority_group);
+ }

... and:

+ if (*lsnlist == NIL)
+ {
+ *lsnlist = lappend(*lsnlist, lsn);
+ }

In sync_info_object_field_start:

+ ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Unrecognised key
\"%s\" in file \"%s\"",
+ fname,
SYNC_FILENAME)));

I think this should use US spelling (-ized) as you have it elsewhere. Also
the primary error message should not be capitalised according to the "Error
Message Style Guide".

--
Thomas Munro
http://www.enterprisedb.com

From:	Sameer Thakur-2 <Sameer(dot)Thakur(at)nttdata(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-09-15 12:02:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,
Continuing testing:

For pg_syncinfo.conf below an error is thrown.

{
"sync_info":
{
"quorum": 3,

"nodes":
[

{"priority":1,"group":"cluster1"},

"A"
]
},
"groups":
{
"cluster1":["B","C"]
}
}

LOG: database system is ready to accept connections
LOG: autovacuum launcher started
TRAP: FailedAssertion("!(n < list->length)", File: "list.c", Line: 392)
LOG: server process (PID 17764) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2015-09-15 17:15:35
IST

In the scenario here the quorum specified is 3 but there are just 2 nodes,
what should the expected behaviour be?
I feel the json parsing should throw an appropriate error with explanation
as the sync rule does not make sense. The behaviour that the master keeps
waiting for the non existent 3rd quorum node will not be helpful anyway.

regards
Sameer

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5865954.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Sameer Thakur <Sameer(dot)Thakur(at)nttdata(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-09-15 13:25:18
Message-ID:	CAOG9ApG97croEU2Dd_3Cze3Cx0oVQzN58Z+cO7nLMxxEU09Few@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Thank you Thomas and Sameer for checking the patch and giving your comments!

I will post an updated patch soon.

Regards,

Beena Emerson

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Beena Emerson <memissemerson(at)gmail(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-09-17 05:23:41
Message-ID:	CAEepm=0e=g2MUk9cUZEMpRp+pOExtGC7PTV+USffjZG8i+O-sw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 15, 2015 at 3:19 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> I got the following error from clang-602.0.53 on my Mac:
>
> walsender.c:1955:11: error: passing 'char volatile[8192]' to parameter of
> type 'void *' discards qualifiers
> [-Werror,-Wincompatible-pointer-types-discards-qualifiers]
> memcpy(walsnd->name, application_name,
> strlen(application_name));
> ^~~~~~~~~~~~
>
> I think your memcpy and explicit null termination could be replaced with
> strcpy, or maybe something to limit buffer overrun damage in case of sizing
> bugs elsewhere. But to get rid of that warning you'd still need to cast
> away volatile... I note that you do that in SyncRepGetQuorumRecPtr when you
> read the string with strcmp. But is that actually safe, with respect to
> load/store reordering around spinlock operations? Do we actually need
> volatile-preserving cstring copy and compare functions for this type of
> thing?

Maybe volatile isn't even needed here at all. I have asked that
question separately here:

http://www.postgresql.org/message-id/CAEepm=2f-N5MD+xYYyO=yBpC9SoOdCdrdiKia9_oLTSiu1uBtA@mail.gmail.com

In SyncRepGetQuorumRecPtr you have strcmp(node->name, (char *)
walsnd->name): that might be more problematic. I'm not sure about
casting away volatile (it's probably fine at least in practice), but
it's accessing walsnd without the the spinlock. The existing
syncrep.c code already did that sort of thing (and I haven't had time
to grok the thinking behind that yet), but I think you may be upping
the ante here by doing non-atomic reads with strcmp (whereas the code
in master always read single word values). Imagine if you hit a slot
that was being set up by InitWalSenderSlot concurrently, and memcpy
was in the process of writing the name. strcmp would read garbage,
maybe even off the end of the buffer because there is no terminator
yet. That may be incredibly unlikely, but it seems fishy. Or I may
have misunderstood the synchronisation at work here completely :-)

--
Thomas Munro
http://www.enterprisedb.com

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-09-24 08:07:41
Message-ID:	CAD21AoC0FdL=PwPf_ioQLG7iBqAtDpU5boeWDzig+-Uj8QUrUQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 11, 2015 at 10:15 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Fri, Sep 11, 2015 at 3:41 AM, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>> Please find attached the WIP patch for the proposed feature. It is built
>> based on the already discussed design.
>>
>> Changes made:
>> - add new parameter "sync_file" to provide the location of the pg_syncinfo
>> file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
>> pg_ident file.
>
> I am not sure that's really necessary. We could just hardcode its location.
>
>> - pg_syncinfo file will hold the sync rep information in the approved JSON
>> format.
>
> OK. Have you considered as well the approach to add support for
> multi-line GUC parameters? This has been mentioned a couple of time
> above as well, with something like that I imagine:
> param = 'value1,' \
> 'value2,' \
> 'value3'
> and this reads as 'value1,value2,value3'. This would benefit as well
> for other parameters.
>

I agree with adding support for multi-line GUC parameters.
But I though it is:
param = 'param1,
param2,
param3'

This reads as 'value1,value2,value3'.

Regards,

--
Masahiko Sawada

From:	Beena Emerson <memissemerson(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2015-10-08 14:10:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sawada Masahiko wrote:
>
> I agree with adding support for multi-line GUC parameters.
> But I though it is:
> param = 'param1,
> param2,
> param3'
>
> This reads as 'value1,value2,value3'.

Use of '\' ensures that omission the closing quote does not break the entire
file.

-----
Beena Emerson

--
View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5869289.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.