Page MenuHomePhabricator

Make limited information from filearchive available to everyone
Closed, ResolvedPublic

Description

Original bug title:
Make limited information from API:Filearchive available to everyone

Reasoning:
When it comes to identifying copyright violations and [[WP:Sock puppetry]], it is essentially helpful if you can check whether a file has been previously uploaded without uploading the file into the stash yourself.

Demand:
title and size, filterable by sha1
( fasha1=HEXHASH&faprop=title|size )

What about privacy?
Not an issue. If you upload to a file to the stash, you are able to obtain this information anyway.


Version: 1.23.0
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=57697

Details

Reference
bz58993

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:28 AM
bzimport set Reference to bz58993.
bzimport added a subscriber: Unknown Object (MLST).

Dumb question: what's the use case for this? (Ideally I'd also like to understand the use case for the existing functionality as well, but one thing at a time...)

(In reply to comment #1)

Dumb question: what's the use case for this?

see Reasoning. +Let me give you 3 examples:

User uploads copyright violation. Patroller marks file for deletion. Admin deletes file. User uploads same file again. Patroller can now sha1lookup whether a similar file did exist before at https://commons.wikimedia.org/w/index.php?title=Commons:User_scripts/File_Analyzer&withJS=MediaWiki:FileAnalyzer.js
and identify the user(s) who uploaded that file.

Bot coder and bot are not administrators. Bot uploads a batch of very huge files. But some were previously deleted and should not be uploaded again. Bot could check SHA1 before uploading to save bandwidth.

File is marked for transfer from en.wikipedia to Commons. Bot/Tool could check whether this file was previously deleted at Commons and refuse the transfer.

...
Please let me know if this was convincing enough or whether you would like to get more feedback from Commons users. Or are you asking for a technical explanation of SHA1 and that kind of stuff? Sorry, here at bugzilla, it's always a bit difficult to get it right because I never know to whom I am talking without googleing.

(In reply to comment #2)

I never know to whom I am talking without googleing.

I've bookmarked https://wikimediafoundation.org/wiki/Staff?showall=1 for that :)

Examples were perfect, thanks - understand the use case much better now.

I'm fine with this from a privacy perspective, as long as it respects suppression of titles (which should also be respected if you do a full file upload - I understand that isn't currently the case, have filed bug 59167 for that.)

[Also, I've tweaked my settings to say a little bit about who I am, hope that helps (though I suppose that might make you *more* likely to explain SHA1, which I definitely don't need!) ]

From a (non-sysop) bot writing perspective, it would be great to be able to get an array of previous deletions for an queried SHA-1. At the moment pywikipediabot passes back a name of a matching file, but not all matches.

I suggest that the deleted file names are passed back (incredibly useful info when these contain reference numbers from the original source, such as Flickr photo ids) *unless* there were a reason to suppress the filename from the deletion log. Other basic information (dates, uploader, editors) would be great for a bot to take action on, or make decisions about. Scenarios include a bot taking different actions based on whether it sees its name as a past uploader or whether upload dates fall within the dates of a recent batch upload project.

There may be privacy issues on some data elements (such as listing all past editors or uploaders), however I think we should expect to be able to automatically distinguish between ordinary deleted material (such as copyvios) and files which were deleted due to respect/privacy concerns.

Not related to 58791, removing dependency.

Simply claiming this task to get some kind of todo list; if someone else beats me, please just take this task!

This would be massively useful to OTRS agents who are non-admins on Commons, including myself. Any progress?

This has long since available on the replicas, see eg. T71088: Queries of commonswiki_p.filearchive for fa_sha1 are slow What's the blocker here?

Uh nevermind this task is about the API

Simply claiming this task to get some kind of todo list; if someone else beats me, please just take this task!

Working on it.

Change 530775 had a related patch set uploaded (by Don-vip; owner: Don-vip):
[mediawiki/core@master] T60993 - Make limited information from filearchive available to everyone

https://gerrit.wikimedia.org/r/530775

It works \o/ This is a query with all information we can get without having any special user right:

https://commons.wikimedia.org/w/api.php?action=query&list=filearchive&fasha1=7f08c97431182ef389ef6f09faac9ff6410b5674&faprop=sha1|timestamp|user|size|dimensions|mime|mediatype|bitdepth|archivename

{
    "batchcomplete": "",
    "query": {
        "filearchive": [
            {
                "id": 5733948,
                "name": "Rauf_&_Faik_young_2.jpg",
                "ns": 6,
                "title": "File:Rauf & Faik young 2.jpg",
                "userid": 7962629,
                "user": "Saulishki",
                "sha1": "7f08c97431182ef389ef6f09faac9ff6410b5674",
                "timestamp": "2019-07-28T21:17:25Z",
                "size": "600213",
                "height": "833",
                "width": "1080",
                "mediatype": "BITMAP",
                "bitdepth": "8",
                "mime": "image/jpeg"
            }
        ]
    }
}