Page MenuHomePhabricator

CAPTCHA required to edit any page on testwiki containing a link with no path
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

A CAPTCHA is triggered.

What should have happened instead?:

No CAPTCHA is triggered. For example, this edit on enwiki did not require a CAPTCHA.

The same problem also occurs if:

This also affects mediawikiwiki, but not test test2wiki.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Seems to affect more than just the ConfirmEdit. Check out this abuse log hit:

  • added_links is https://spam.org
  • removed_links is https://spam.org/

Note the trailing slash, which the edit did NOT remove.

Could this have something to do with T326251 and e06f77134defc?

		if ( $extlinkStage & SCHEMA_COMPAT_READ_OLD ) {
			return $dbr->newSelectQueryBuilder()
				->select( 'el_to' )
				->distinct()
				->from( 'externallinks' )
				->where( [ 'el_from' => $pagId ] )
				->caller( $fname )->fetchFieldValues();
		} else {
			$links = [];
			$res = $dbr->newSelectQueryBuilder()
				->select( [ 'el_to_domain_index', 'el_to_path' ] )
				->from( 'externallinks' )
				->where( [ 'el_from' => $pagId ] )
				->caller( $fname )->fetchResultSet();
			foreach ( $res as $row ) {
				$links[] = LinkFilter::reverseIndexe( $row->el_to_domain_index ) . $row->el_to_path;
			}
			return $links;
		}
`

Which wikis are SCHEMA_COMPAT_READ_OLD, and what is el_to_path if the path is empty?

Oh here we go:

// T312666
'wgExternalLinksSchemaMigrationStage' => [
        'default' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_OLD,
        'testwiki' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_NEW,
        'mediawikiwiki' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_NEW,
        'fawikiquote' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_NEW,
],

I DID get a captcha for this edit on mediawikiwiki.
I DID NOT get a captcha for this edit on test2wiki

suffusion_of_yellow renamed this task from CAPTCHA required to edit any page containing a link on testwiki to CAPTCHA required to edit any page on testwiki containing a link with no path.May 20 2023, 10:11 PM
suffusion_of_yellow updated the task description. (Show Details)

LinkFilter::makeIndexes() has:

$index2 = $bits['path'] ?? '/';

But that makes it impossible to know if the path was "/" or empty, and the original URL can't be recovered, except from el_to which I gather from T312666 you plan on getting rid of.

And if you can't recover the full URL, you're going to get a different output from the "before" and "after" sets of links when you try to find out what the user added. So either

  • el_to_path needs to be an empty string, iff the original URL had no path (the old DB entries can be batch updated from el_to), or
  • Every extension that checks for added links (SpamBlacklist, AbuseFilter, ConfirmEdit, and who knows what else) needs to append a "/" to any URL with no path, in both the "before" and "after" links

And it gets worse. This also happen with:

So it looks as if a whole lot of information is being lost, and it can't be recovered from the database using el_to_domain_index and el_to_path alone.

Umherirrender subscribed.

There is some work on external links. Maybe the Captcha code does not get the right list of old links since that from the database and making every compare with the new page gives alist of new links, which trigger a captcha. It is not related to abuse filter or spam blacklist. AbuseFilter could be affected as well, it also allows to filter on external links, but would not assume a filter on testwiki for that.

So https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/914280 is the reason

Yes, AbuseFilter is affected by this, see the incorrect added_links and removed_links in my comment above. Any extension that uses getExternalLinksForPage will need to be updated, if the plan really is to do away with el_to.

I can confirm that SpamBlacklist is affected by this too. This blacklist hit was triggered by attempting to add https://example.com to a page already containing https://spam.site. Not sure why the tags were removed.

SpamBlacklist does not get the links with own code, it is using core's ExternalLinksLookup::getExternalLinksForPage, so nothing to fix in SpamBlacklist. Also AbuseFilter is using this way.

on testwiki (with the schema new) ExternalLinksLookup::getExternalLinksForPage is using the database fields with index in the name. The index columns are set up with a empty path to make searches on the special page easier, even the / is not set in the page. There is no way to distinct if the url was set with empty path or without path in the wikitext.

There are at least two ways to fix this: Making the new links with empy path in ParserOutput::getExternalLinks (that is in core) or make the compare between old/new beware of the empy path to keep them the same (that means in both extensions). Not sure what would be the best.

The fix for this is rather simple: Currently, it takes the existing external URLs from the table and compare it with links added in the edit and due to the new way of indexing, it think these are added now. The solution is to make the second part go through indexing as well and then compare. That way, it doesn't think spam link is "added" in the edit.

Change 923294 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ExternalLinks: Add support for non-reveresed indexed URLs

https://gerrit.wikimedia.org/r/923294

Change 923295 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/ConfirmEdit@master] Improve support for read-new wikis with externallinks

https://gerrit.wikimedia.org/r/923295

Ladsgroup added a project: DBA.
Ladsgroup moved this task from Triage to In progress on the DBA board.

Change 923294 merged by jenkins-bot:

[mediawiki/core@master] ExternalLinks: Add support for non-reveresed indexed URLs

https://gerrit.wikimedia.org/r/923294

Change 923610 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/AbuseFilter@master] Improve support for read-new wikis with externallinks

https://gerrit.wikimedia.org/r/923610

Change 923613 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/SpamBlacklist@master] Improve support for read-new wikis with externallinks

https://gerrit.wikimedia.org/r/923613

Change 923295 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] Improve support for read-new wikis with externallinks

https://gerrit.wikimedia.org/r/923295

Change 923613 merged by jenkins-bot:

[mediawiki/extensions/SpamBlacklist@master] Improve support for read-new wikis with externallinks

https://gerrit.wikimedia.org/r/923613

Change 923610 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Improve support for read-new wikis with externallinks

https://gerrit.wikimedia.org/r/923610

@Ladsgroup: Thanks. Tested this again, and most of the listed problems have been fixed. But I still trip AbuseFilter, SpamBlacklist and ConfirmEdit when I edit a page containing a link with a port number, see https://test.wikipedia.org/wiki/Special:AbuseLog/98826. I also discovered a new problem with IPv6 links; see https://test.wikipedia.org/wiki/Special:AbuseLog/98822.

okay good. I don't think this is a blocker anymore but I will try to see what I can do to make this work. It really shouldn't error as the before and after both gone through "indexifying" so they should not cause this. Anyway. I'll check it later this week or next week.

FWIW IPv6 links are fairly common on enwiki, as they're built into various "user info" templates, see https://en.wikipedia.org/w/index.php?title=Wikipedia:Administrator_intervention_against_vandalism/TB2&oldid=1157765051 for example. I have no idea how common port numbers are.

When does test2wiki update to wmf11? I want to be sure these changes cause anything unexpected at READ_OLD wikis, or we'll be dealing with issues at enwiki on Thursday.

test2wiki will be on wmf.11 tomorrow and the function short circuits on read old. I check why it gets the reversed one for mailto: but that should be easy to fix.

@Legoktm I don't think it's a train blocker. Can you elaborate why you made it such?

Generally issues that affect anti-abuse stuff are train blockers and there seemed to be unfixed issues based on the last comments about IPv6 and mailto links? But if I misunderstood the severity, please feel free to revert me.

That's on read new wikis which is test wikis only. SOY wants to test them to see if it's affecting read old too or not which is just a double check.

Change 924964 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ExternalLinks: Fix mailto: handling in read new

https://gerrit.wikimedia.org/r/924964

Fixed mailto: That was rather easy. IPv6 and IPv4 are not that easy to fix though. On it.

Change 924989 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] ExternalLinks: Make IP links work with read new

https://gerrit.wikimedia.org/r/924989

So the IP is also fixed now. Now I need someone to review and merge it.

Change 924964 merged by jenkins-bot:

[mediawiki/core@master] ExternalLinks: Fix mailto: handling in read new

https://gerrit.wikimedia.org/r/924964

Change 924989 merged by jenkins-bot:

[mediawiki/core@master] ExternalLinks: Make IP links work with read new

https://gerrit.wikimedia.org/r/924989

Ladsgroup moved this task from In progress to Done on the DBA board.

I'll start setting more wikis to read new

I did a more thorough check at testwiki, including all supported URI schemes. There is still a problem with port numbers, see https://test.wikipedia.org/wiki/Special:AbuseLog/98932.

I don't know how common pages with port numbers are, but this search finds over 11000 links before it times out.

Change 928640 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Externallinks: Make port part of the index

https://gerrit.wikimedia.org/r/928640

Change 928640 merged by jenkins-bot:

[mediawiki/core@master] Externallinks: Make port part of the index

https://gerrit.wikimedia.org/r/928640

Change 928608 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.41.0-wmf.12] Externallinks: Make port part of the index

https://gerrit.wikimedia.org/r/928608

Change 928608 merged by jenkins-bot:

[mediawiki/core@wmf/1.41.0-wmf.12] Externallinks: Make port part of the index

https://gerrit.wikimedia.org/r/928608

Mentioned in SAL (#wikimedia-operations) [2023-06-08T20:20:50Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:928608|Externallinks: Make port part of the index (T337149)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-08T20:22:27Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:928608|Externallinks: Make port part of the index (T337149)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-08T20:31:01Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:928608|Externallinks: Make port part of the index (T337149)]] (duration: 10m 10s)

Still something weird going on with protocol-relative (?) links, see https://test.wikipedia.org/wiki/Special:AbuseLog/99197 where a dummy edit apparently was seen as adding and removing about 30 links.

It was due to stale data, removing everything and putting it back fixed it. Using of proto-relative URLs is discouraged and should be removed.

Thanks. Will this be a problem for a while on enwiki while the stale data is flushed out? Yes, it would be nice to do away with all those links eventually but they're everywhere...

It will be flushed out soon after reparse once the wiki goes on write new only, it'll take a bit but we will get there soon.

It seems it's 15.5M links are in enwiki (31M rows in db) but hopefully a lot of them would be on templates (I fixed some long time ago: https://en.wikipedia.org/w/index.php?title=Module:Citation/CS1/Configuration&diff=prev&oldid=1138454880) let me query this in stats machine:

mysql:[email protected] [enwiki]> select el_to_domain_index, count(*) from externallinks where el_to like '//%' group by el_to_domain_index order by count(*) desc limit 500;

Once done, you'll get the most common domains, probably some will show up in highly used templates and can be cleaned up (and it'll free up a lot of space)

P49464

Fixing geohack removes 2.5M from that. I'll go through the list now.

I'm working on removing some of these. {{SERVER}} magic word currently outputs //en.wikipedia.org (see also {{SERVERNAME}} outputting en.wikipedia.org) [ Magic words ]. I am not sure of the difference between the two variables or if that indicates that SERVER should not be domain relative. It will make an additional search required to find things.

I'm working on removing some of these. {{SERVER}} magic word currently outputs //en.wikipedia.org (see also {{SERVERNAME}} outputting en.wikipedia.org) [ Magic words ]. I am not sure of the difference between the two variables or if that indicates that SERVER should not be domain relative. It will make an additional search required to find things.

And {{fullurl:fullpagename |query}} outputs a protocol relative link according to same. That could be replaced 1-for-1 by {{canonicalurl:fullpagename |query}}. Might be better either to deprecate the one version or make it match the canonicalurl.

This comment was removed by Snaevar.

For now we can use {{SERVERNAME}} until we properly deprecate proto-relative urls in magic words