Page MenuHomePhabricator

Schema request change so deleted edits are identified by revisionID not timestamp (prevents DIFFs from breaking)
Open, LowPublic

Description

Author: FT2.wiki

Description:
At present a revision is usually identified by a revisionID in most functions. Thus a public and visible revision, a suppressed (RevisionDeleted) revision, and an oversighted revision, all are encoded and identified by the revision ID.

Normal deleted revisions are the sole exception - they are identified by a timestamp. This has two problems:

1/ Timestamp (to the precision used to identify a deleted revision: YYYYMMDDHHMMSS) may in some cases not be unique.

2/ Deleted diffs and revisions can't be identified from any link prior to deletion, since upon deletion they switch from being identified by revision ID, to being identified by timestamp. This prevents easy lookup of a diff (eg when a privacy or dispute arises), preventing admins, checkusers and others from using a diff to check up on a matter if one of the revisions in the diff has been deleted, and seems to prevent diffs working at all, other than the simple case of comparing two sequential revisions.

DESIRED CHANGE

1/ All revisions, deleted or visible, to be identified by their revision ID, not timestamp or other identifier. (This may involve a schema change to deleted revisions handling.)

2/ A user who can see the text of both revisions in a diff (eg specified and next, specified and prev, or 2 specified revisions), is always able to view the diff between them; the fact one or both may have been deleted doesn't break this functionality.


See Also:
T2851: when viewing an old version of a page, use old version of templates

Details

Reference
bz18104

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:33 PM
bzimport set Reference to bz18104.
bzimport added a subscriber: Unknown Object (MLST).

FT2.wiki wrote:

(Note: This is somewhat related to, but not the same as, bug 18068.

They overlap in that in each case the request involves ability to view deleted diffs, but in this case "traditional deletion" also identifies the diffs in a manner that breaks linkage from diff (revisionID) to deleted diff, and makes it very hard to compare two arbitrary diffs if one or both are "traditionally deleted".

Note that many older revs have a NULL ar_rev_id value

FT2.wiki wrote:

See also bug 21279.

More specifically, any revs in the archive table that were deleted before Wikipedia was upgraded to MediaWiki 1.5 in June 2005, which were never undeleted after the upgrade, will have a nul ar_rev_id value.

FT2.wiki wrote:

Presumably that can be fixed by undelete and redelete then? Or a once-off script that fills that field?

Yeah, probably. If this is done, they'll most likely get a high rev ID as if they're a new edit ... about 323,000,000 or so. It would be nice if they could get their old rev_ID's from when they weren't deleted, but I don't think that's possible.

FT2.wiki wrote:

OverlordQ and I took a look at this on the toolserver. Some of this may be obvious or well known - I don't know how much MediaWiki stuff from 2005 would be "common knowledge".

The latest enwiki deleted revision with no rev_id is timestamped 20050627053602 (June 27 2005, 5.36 am) as Aaron and Graham say. 511728 deleted revisions have no rev_id.

(Around 2356 revisions also have an entry with the same rev_id in both current and deleted revisions tables. This is presumably due to old data slippage.)

Deleted revisions from before June 2005 which were restored apparently got allocated a new rev_id. Eg, compare the dates for enwiki revision id's 15700000 (June 14 2005), 15700001 (June 9 2005), 16300000 (May 1 2005), and 17000001 (Dec 8 2004). It doesn't seem to cause problems though.

There appear to have been around ~ 17,739,500 revisions on enwiki prior to the changeover of June 2005. Because rev_ids were reallocated you have to go back and forth by 50 or 100 at a time to get an idea what rev__id was reached at roughly what sort of time. It turns out that there were about 17.74 m enwiki revisions at the changeover.

Oversight and Developer deletions would have been negligible up to that point. So in principle, there were approximately 17.7 m revisions prior to the changeover of which 17,043,322 can be traced to "Live data", leaving 696k revision ids untraced.

The conclusion is that the 511 k of old deleted revisions with rev_id = NULL can be sequenced into the 17.7 m known rev_ids prior to the changeover, and there are 696 k rev_ids of deleted revisions which they map into. (The explanation for the other 185 k isn't clear. Delete/restore activity on old revisions??)

It looks like all the deleted revisions with a null value can be matched fairly accurately by time order against existing gaps in the current revisions and assigned a suitable rev_id that's currently not taken. It might not be perfect but it'll be close, and allocating a time-sequenced old rev_id is probably helpful for admins and the like.

Deletions are quite interspersed with undeleted revisions so this isn't a job requiring human guesswork. It could be a once-off task suited to a script.

This would at least mean every deleted revision had a rev_id, which is a first step in fixing the problems.

The 185 k leftover revision ID's are probably due to the fact that the deletion archive was cleared twice, in June 2004 and December 2003. It was created in August 2002, so a negligible number of edits were permanently deleted. See:
http://en.wikipedia.org/wiki/Wikipedia:Viewing_and_restoring_deleted_pages#Deletion_archive

The numbers surprise me a bit. They seem to indicate that about three times as many edits were deleted between June 2004 and June 2005 than in the entire period up to June 2004. Perhaps revision ID numbers were reset or reused at some point; It'd be best to ask Brion or Tim about this. The numbers are particularly surprising because those old deleted revisions would presumably include edits to the Wikipedia sandbox, which was routinely moved to bizarre titles by newbies or outright deleted; a page move of a page with many revisions in MW 1.4 and below was a much bigger disaster than it is now, and move protection wasn't introduced until December 2004, see:
http://en.wikipedia.org/wiki/Wikipedia_talk:Protected_page/Archive2#Protection_against_page_moves

I once rescued some old sandbox revisions that go back to June 2004; the ones before then were deleted and are now irretrievable. I used to believe that there were about 50,000 irretrievable sandbox revisions, but with the numbers you have presented here, that may be an exaggeration. For the old sandbox edits, see:
http://en.wikipedia.org/wiki/Wikipedia:Historical_archive/Sandbox

Can I have an example of some revisions having the same ID in the deletion and regular tables? That is very bizarre.

Out-of-order revision IDs cause problems for diffs; the number of intermediate revisions is misreported (as that function works by rev ID), and since the prev/next links in diffs are also ordered by rev ID, they are also affected. See this diff as an example:
http://en.wikipedia.org/w/index.php?title=Talk:Netherlands&diff=11427366&oldid=229588957

More rarely, revisions have the opposite problem; they have the correct revision ID vbut the wrong date. See bug 2219.

overlordq wrote:

The duplicate revision id's I've opened as bug 22392, I put it in Wikimedia as I wasn't 100% on the product.

ayg wrote:

(In reply to comment #7)

(The
explanation for the other 185 k isn't clear. Delete/restore activity on old
revisions??)

AUTOINCREMENT columns aren't guaranteed to be allocated sequentially; values can be skipped. In particular, if a transaction inserts a row, then gets rolled back rather than committed, there will be a gap, since the id is assigned at insert time and not at commit time. Autoincrementing is visible immediately, even across transaction boundaries -- which would be a violation of transactional semantics if id's were guaranteed to be sequential, but they aren't.

(This probably doesn't account for *that* many missing revisions, though.)

  • Bug 23695 has been marked as a duplicate of this bug. ***

FT2.wiki wrote:

As an interim option, can we at least have "&oldid=" work with Special:Undelete?

Most deleted revisions have a revision id and the field is indexed. The few old deleted revisions that don't have a revision id can easily be given one.

Revid/oldid is universally used everywhere else to identify a revision, except when it comes to deleted revisions. It would be a fairly simple change to have Mediawiki correctly handle links of the form:
http://en.wikipedia.org/w/index.php?title=Special:Undelete&oldid=12345

as an equally valid alternative to the existing:
http://en.wikipedia.org/w/index.php?title=Special:Undelete&target=PAGENAME&timestamp=TIMESTAMP

Advantages:

1/ Whether a revision is deleted or undeleted the revisionid stays the same so it can still be used to by admins to find the revision;

2/ Allows a failback to be added to Mediawiki that if &oldid= is given in a link, and the revision isn't in the current revisions table, it can easily and automatically be searched for in the deleted revisions table instead (and vice versa), so diff links will "dead end" less often.

3/ It's simple to do and probably trivial on effort (if Special:Undelete args include a page+timestamp look up the data by those, if the args are an oldid then look up the data by that in the same table);

4/ Makes it easier to transition in future to using rev_id as the common index field for revisions whether deleted or visible, which is a simplifying direction for Mediawiki. Pagename/timestamp would continue to work so nothing "breaks", but allowing oldid to work will make a future transition easier.

MER-C subscribed.

Proposing for the 2017 Developer Wishlist (schema change only).

Looks like from T25695 that this task might have gained some useful improvements. @aaron @jayvdb do anyone of you are aware of the current status of this task?

That task was closed as a duplicate of this one.

An ar_id field was added in 9b2b027ba7bc922bb9150cad19c49feb0c892f9f (per T41675), it sounds like that resolves this bug (and that one too, not sure why it's open…).

Tgr subscribed.

DESIRED CHANGE

1/ All revisions, deleted or visible, to be identified by their revision ID, not timestamp or other identifier. (This may involve a schema change to deleted revisions handling.)

2/ A user who can see the text of both revisions in a diff (eg specified and next, specified and prev, or 2 specified revisions), is always able to view the diff between them; the fact one or both may have been deleted doesn't break this functionality.

The first was fixed in T2603. The second is not in scope for the developer wishlist, hence removing tag.