Page MenuHomePhabricator

Nested refs expose strip markers
Closed, ResolvedPublic

Description

After 1.18 upgrade, all uses of nested references using #tag:ref expose the strip markers. Example:

{{#tag:ref|note<ref>reference</ref>|group=Note}}
<references group=Note />
<references />

This will show the strip marker starting with UNIQ.


Version: unspecified
Severity: major

Details

Reference
bz31374

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 11:57 PM
bzimport added a project: Cite.
bzimport set Reference to bz31374.
bzimport added a subscriber: Unknown Object (MLST).

Support for nested references using #tag:ref was added in rev 33066 and has been working with some limitations. This method is documented and well used on the English Wikipedia.

For some real examples, see articles such as

  • [[Anna_Essinger#Footnotes]]
  • [[Bunce_Court_School#Footnotes]]

which were mentioned on Village Pump:
https://secure.wikimedia.org/wikipedia/en/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=454067107#bug_in_.23tag:ref_parser_function

I did a binary search through the SVN history to pinpoint the revision that first exhibited the bug. It appears to have been r82645; the bug does not occur in r82644.

Ok, I think I see what is happening here.

The parser finds a <ref> tag. It calls the hook registered by Cite.php. Cite.php generates wikitext something like this to output to the page in place of a <ref> tag:

<sup id="cite_ref-1" class="reference">[[#cite_note-1|<span><nowiki>[</nowiki></span>1<span><nowiki>]</nowiki></span>]]</sup>

It passes this to the parser's recursiveTagParse. The parser returns something like this:

<sup id="cite_ref-1" class="reference"><a href="#cite_note-1"><span>�UNIQ-1�</span>1<span>�UNIQ-2�</span></a></sup>

Note it replaced the <nowiki> tags with UNIQ markers. Cite.php returns this string from its hook function. The parser stores that as another UNIQ marker (call it �UNIQ-3�), and outputs the UNIQ marker to the page in place of the <ref> tag.

Then, later on, the parser unstrips all the uniq markers for output.

In r82644, does this in a loop: replace all UNIQ markers in the text and repeat until the text stops changing. So in this example, it replaces �UNIQ-3� with text that contains �UNIQ-1� and �UNIQ-2�, and then in the second pass it replaces �UNIQ-1� and �UNIQ-2�, and then in the third pass it makes no changes and stops.

In r82645, it does *not* loop. So in this example, it replaces �UNIQ-3� with text that contains �UNIQ-1� and �UNIQ-2� and stops there, leaving those markers in the page output.

I can think of two possible fixes: have StripState's unstripGeneral(), unstripNowiki(), and unstripBoth() loop as the old version did, or have StripState::addItem() call unstripBoth() on $value before storing it. It also looks like merge() needs to be adjusted to process the saved values, to unstrip them in the latter solution or to rename any existing markers in the former. In the former solution getSubState() also needs to be changed, otherwise it will keep �UNIQ-3� in our example and forget about �UNIQ-1� and �UNIQ-2�.

CC Tim.

Tim, could you look into this? See comment 4 for a detailed analysis of the problem.

(In reply to comment #6)

This bug does not occur on secure.wikimedia.org (compare
https://secure.wikimedia.org/wikipedia/en/wiki/Tropical_Storm_Debra_%281978%29#Footnotes
and https://en.wikipedia.org/wiki/Tropical_Storm_Debra_%281978%29#Footnotes),
even though Special:Version says both are running MW 1.18wmf1 at r98778.

That's probably due to parser cache. Does it happen if you edit the page on secure, change nothing, then click Show preview? If yes, then it's just pcache caching an old, uncorrupted version.

jelle.zijlstra wrote:

It doesn't happen, actually. I also tried to purge and null-edit the page, and I still don't get any UNIQ...QINU on secure.wikimedia.org.

jelle.zijlstra wrote:

It turns out I'm hiding the UNIQ keys with CSS, because they appear in the same place where the square brackets around the key would normally appear. Sorry for the confusion.

Unstrip markers are also exposed when <ref> tags are missed with other extensions, such as those for <gallery>, e.g.

http://en.wikipedia.org/w/index.php?title=Elizabeth_II&oldid=452751535#Arms

This behavior seems to have the same origin as that described by Brad in comment 4.

uplus003f wrote:

I notice that the bug is (for me) only an issue in firefox, whilst safari and opera display the expected behaviour.

Created attachment 9166
Possible patch for the issue

I've attached a possible patch for this issue. It takes the approach of unstripping $value in StripState::addItem().

Attached:

I strongly urge rejecting any patches that make the current hacks more acceptable. From memory:

  • Nested refs are currently not supported.
  • People implemented a hack using {{#tag:ref}} to get around this.
  • As expected, a software upgrade has broken this hack.

The solution here is to add proper support for nesting references (bug 16330). Further fiddling with the parser to maintain bad hacks should be avoided at all costs, in my opinion.

Created attachment 9167
Patch adding parser test case to Cite extension

Patch takes the example from comment 1 and adds it to Cite's parser test cases. (The bug itself is in core, but none of the core tag hooks appear to do recursive parsing so wouldn't trigger it.)

The output is what I get running against 1.17 (looks fine!), and seems to match fine again with Brad's patch installed.

All other parser test cases also seem to run ok with the patch installed, and it looks ok to me.

Would like a quick check with Tim since this is a fix to his black magic. ;)

Attached:

(In reply to comment #14)

I strongly urge rejecting any patches that make the current hacks more
acceptable. From memory:

  • Nested refs are currently not supported.
  • People implemented a hack using {{#tag:ref}} to get around this.
  • As expected, a software upgrade has broken this hack.

The solution here is to add proper support for nesting references (bug 16330).
Further fiddling with the parser to maintain bad hacks should be avoided at all
costs, in my opinion.

In theory you should be able to nest as much as you like using {{#tag}}, eg:

{{#tag:ref|blah blah blah{{#tag:ref|blah2}}}}

and indeed this seems to render fine on 1.17.

The main limitation against nesting tags in the <xmlish> tag form is that our parser since way back is oversimplified: it takes the shortest match and doesn't assume ahead of time that the contents would be meaningful wikitext; kinda like how <textarea> and <script> in HTML aren't allowed to contain </textarea> or </script>, because they have character data content models so a nested <textarea> or <script> opener would seem invisible to the HTML parser.

The curly brace syntax however is quite explicitly ok to nest, as the internals are treated like wikitext and get brace expansion pre-done inside before the hook gets them.

What seems to be the problem with that? Is there internals crud in Cite itself that makes nesting situations particularly perilous? Or does it just seem like a poor practice for Cite/ref in particular because it sounds like an odd usage pattern?

In theory you should be able to nest as much as you like using {{#tag}}

And it does indeed work. I even created template refn because editors were messing up the syntax— group and note have to be at the end of the #tag:ref, after content, which is not intuitive.

However, when using list-defined references, you can only include one nested ref. More throws a "Cite error references no key". See Bug 20707.

This is a real PITA, and has resulted in the proliferation of templates that do not use cite.php, but result in duplicate HTML ids and invalid HTML output.

Yes, rewriting the parser to fix this would be a really good thing. But we should at least get back to the point where we mostly worked.

Should be fixed now (r99062, r99071).

Looks good. Back to mostly working state. I have updated the VPT list.

Thanks!

(In reply to comment #16)

(In reply to comment #14)

I strongly urge rejecting any patches that make the current hacks more
acceptable. From memory:

  • Nested refs are currently not supported.
  • People implemented a hack using {{#tag:ref}} to get around this.
  • As expected, a software upgrade has broken this hack.

The solution here is to add proper support for nesting references (bug 16330).
Further fiddling with the parser to maintain bad hacks should be avoided at all
costs, in my opinion.

In theory you should be able to nest as much as you like using {{#tag}}, eg:

{{#tag:ref|blah blah blah{{#tag:ref|blah2}}}}

and indeed this seems to render fine on 1.17.

The main limitation against nesting tags in the <xmlish> tag form is that our
parser since way back is oversimplified: it takes the shortest match and
doesn't assume ahead of time that the contents would be meaningful wikitext;
kinda like how <textarea> and <script> in HTML aren't allowed to contain
</textarea> or </script>, because they have character data content models so a
nested <textarea> or <script> opener would seem invisible to the HTML parser.

The curly brace syntax however is quite explicitly ok to nest, as the internals
are treated like wikitext and get brace expansion pre-done inside before the
hook gets them.

What seems to be the problem with that? Is there internals crud in Cite itself
that makes nesting situations particularly perilous? Or does it just seem like
a poor practice for Cite/ref in particular because it sounds like an odd usage
pattern?

Putting the {{{{{{{{{}}}}}}}}} madness issues aside, the use-case isn't {{#tag:ref|{{#tag:ref}}}}, it's {{#tag:ref|<ref></ref>}} (see comment 0). Accommodating (and encouraging) syntax like this is exactly what got the parser into the shape it's in now. Haphazard hacks get ingrained into the code and then suddenly everyone else is expected to be able to deal with them, rather than implementing a proper solution when the issues are obvious. That was largely my point, but it seems to be rather moot now.

A considered decision to change or update ref syntax would be a Good Thing. Implementing "solutions" like this that put a band-aid on the problem and make it more difficult to resolve issues down the line is a Bad Thing.

(In reply to comment #20)

Putting the {{{{{{{{{}}}}}}}}} madness issues aside, the use-case isn't
{{#tag:ref|{{#tag:ref}}}}, it's {{#tag:ref|<ref></ref>}} (see comment 0).
Accommodating (and encouraging) syntax like this is exactly what got the parser
into the shape it's in now. Haphazard hacks get ingrained into the code and
then suddenly everyone else is expected to be able to deal with them, rather
than implementing a proper solution when the issues are obvious. That was
largely my point, but it seems to be rather moot now.

Nested tags have always been allowed, I fixed a related bug as early as r2880. A bug with nested tags does not indicate that there is a problem with #tag, since there are plenty of other ways to do nested tags. If you have some deeper problem with #tag, I suggest you complain on another bug report.

(In reply to comment #21)

(In reply to comment #20)

Putting the {{{{{{{{{}}}}}}}}} madness issues aside, the use-case isn't
{{#tag:ref|{{#tag:ref}}}}, it's {{#tag:ref|<ref></ref>}} (see comment 0).
Accommodating (and encouraging) syntax like this is exactly what got the parser
into the shape it's in now. Haphazard hacks get ingrained into the code and
then suddenly everyone else is expected to be able to deal with them, rather
than implementing a proper solution when the issues are obvious. That was
largely my point, but it seems to be rather moot now.

Nested tags have always been allowed, I fixed a related bug as early as r2880.
A bug with nested tags does not indicate that there is a problem with #tag,
since there are plenty of other ways to do nested tags. If you have some deeper
problem with #tag, I suggest you complain on another bug report.

  • <ref> is implemented.
  • {{#tag:}} is implemented.
  • Users decide to combine the two into new creations.
  • MediaWiki upgrade to 1.18 breaks these (Frankensteinian) creations.
  • Bug is filed documenting exposed strip markers with specific nested markup.
  • [decision whether to properly support nested refs] and/or [decision whether to properly support mixed syntax]

(From comment #20)

A considered decision to change or update ref syntax would be a Good Thing.
Implementing "solutions" like this that put a band-aid on the problem and make
it more difficult to resolve issues down the line is a Bad Thing.

A lack of considered decisions regarding syntax and code structure is what got MediaWiki its current parser. That was my point, buried somewhere in there. Apologies if I hijacked the bug discussion.

(In reply to comment #22)

  • <ref> is implemented.
  • {{#tag:}} is implemented.
  • Users decide to combine the two into new creations.
  • MediaWiki upgrade to 1.18 breaks these (Frankensteinian) creations.
  • Bug is filed documenting exposed strip markers with specific nested markup.
  • [decision whether to properly support nested refs] and/or [decision whether

to properly support mixed syntax]

The point of #tag is to provide access to legacy tag hooks via the newer, more flexible brace syntax. This flexibility allows users to do things which are otherwise difficult, such as nesting ref tags. It's not "Frankensteinian", it's an intended use-case.

As far as I'm concerned, xmlish tags should be deprecated, since backwards compatibility puts constraints on what they can do. Adding even more syntax variations to the grammar for xmlish tags will break compatibility with existing text and make life for WYSIWYG implementors more difficult.

All nested xmlish tags were broken, not just ref tags, so the users who decided to nest ref tags were not at fault for this bug in any way.

We should now focus on Bug 20707, which will probably need parser work. I have made further comments there.

(In reply to comment #22)

  • <ref> is implemented.
  • {{#tag:}} is implemented.
  • Users decide to combine the two into new creations.
  • MediaWiki upgrade to 1.18 breaks these (Frankensteinian) creations.
  • Bug is filed documenting exposed strip markers with specific nested markup.
  • [decision whether to properly support nested refs] and/or [decision whether

to properly support mixed syntax]

Is {{#tag:ref|foo{{#tag:ref|bar}}}} a "Frankensteinian creation" or "mixed syntax" in your mind too? Because it failed in exactly the same way as {{#tag:ref|foo<ref>bar</ref>}}.

goleztrol wrote:

I just upgraded to 1.18 and I got a problem like this when I use a refence within a poem tag ( using the poem extension at http://www.mediawiki.org/wiki/Extension:Poem ).

<poem>"Some text" <ref name="foo"/></poem>

Changing it to

{{#tag:poem|"Some text.{{#tag:ref||name="foo"}}"}}

doesn't change a thing, the result still looks like:

"Some text.UNIQ76f58a4e1b6f37c4-nowiki-00000003-QINU1UNIQ76f58a4e1b6f37c4-nowiki-00000004-QINU"

I'm not nesting any ref tags.

goleztrol wrote:

Nevermind. I altered unstripType to do a recursive replacement.

Then I saw the fix by Tim Starling in Comment 18, which does the same (but in a loop, which is better than recursive). I've overlooked that comment before, and apparently his fix hasn't made it to the stable version yet.. Sorry for the inconvienience.