Wikipedia:Bots/Requests for approval/DeadLinkBOT
Automatic or Manually Assisted: (Mostly) Automatic, supervised
Programming Language(s): Perl
Function Summary: To correct dead links due to link rot
Edit period(s) (e.g. Continuous, daily, one time run): as needed
Already has a bot flag (Y/N): N
Function Details: DeadLinkBOT's purpose is to update links that are invalid due to link rot. The first version of the program will simply replace all instances of a user supplied link with an updated link (i.e. a change pre-approved by me). When needed, the program is capable of making simply determinations about the nature of the WP page in order to pick a new link from a list of alternatives (given user supplied rules). In the future, the program will be expanded to actively seek out updated links after retrieving a list of dead links to be updated. These more advanced changes will require user confirmation. When a page is edited to update a link, the bot will also apply AWB-like general fixes.
Discussion
- What about websites that go through regular downtime? If the bot reads them as dead while they are temporarily down, it will remove a good link. Xclamation point 05:01, 2 December 2008 (UTC)
- It will be attempting to fix 404 links found at Wikipedia:Linkrot. In theory, 404 errors are not due simply to downtime, but rather a page being renamed or moved. Per WP policy, the bot won't remove any link for which it can't find an alternative. I.E. It will only address links that have moved to a new location. These precautions should prevent any removal of temporary unavailable locations.
- The first version of the program will only change links specified in advance, starting with the 2200+ links here [1]. When I add the automatic updated link finding feature, the bot will double check proposed changes with me before making them --ThaddeusB (talk) 15:46, 2 December 2008 (UTC)
- Can you put in a double check of links, say, a week apart to ensure that it's not an short run 404 that caused the problem -- Tawker (talk) 07:17, 6 December 2008 (UTC)
- Yes, I will add that feature. --ThaddeusB (talk) 22:00, 7 December 2008 (UTC)
- Can you put in a double check of links, say, a week apart to ensure that it's not an short run 404 that caused the problem -- Tawker (talk) 07:17, 6 December 2008 (UTC)
- Is the source code to your bot available? — Carl (CBM · talk) 14:15, 6 December 2008 (UTC)
- I wasn't planning on releasing it for public consumption. --ThaddeusB (talk) 22:00, 7 December 2008 (UTC)
- Why not? If its going to be actually be changing links in articles I'd really like to know that the code is sound. Mr.Z-man 21:57, 12 December 2008 (UTC)
- Well considering its explicitly not required, I shouldn't have to justify my decision. But since you asked, my code is undocumented and "ugly" - it is not intended to be read by anyone but me. I really don't see what the issue is - all the program does as far as Wikipedia goes is substitute a pre-screened dead URL for a pre-screened good one, possibly applying pre-screened regrexes to pick between two or more different options. None-the-less, I put the code up anyway: User:DeadLinkBOT/source --ThaddeusB (talk) 23:45, 12 December 2008 (UTC)
- Why not? If its going to be actually be changing links in articles I'd really like to know that the code is sound. Mr.Z-man 21:57, 12 December 2008 (UTC)
- I wasn't planning on releasing it for public consumption. --ThaddeusB (talk) 22:00, 7 December 2008 (UTC)
I have tested the bot with local writes and all works according to plan. I would appreciate it if a trial could be approved for actual wiki editing soon. Thanks. --ThaddeusB (talk) 02:34, 14 December 2008 (UTC)
- This all seems rather sketchy to me. "I'm gonna go through all the articles and change a bunch of links. And... um... apply 'AWB-like' general fixes too." People applying AWB-like changes usually get banned pretty quickly because they rarely consider the large, large number of corner cases. And general fixes require someone to watch them and verify each edit to avoid things getting screwed up. As for the link changes, do you have any examples from articles? Edits you've done by hand (or even using this script)? And will you only be dealing with pages in namespace 0? --MZMcBride (talk) 04:41, 15 December 2008 (UTC)
- First of all, I don't appreciate the attitude. I said nothing like "I'm going to go through all articles and change a bunch of links." What I actually said is that I was to to change all specific instances of a known bad link to a known good link (using Special:LinkSearch). I also said several times that every change would be pre-approved by me. If there was a problem doing general fixes, then why was that never mentioned before now? This request is now 2 weeks old and this is the first I'm hearing of it potentially being a problem. I am certainly willing to drop that part of the request and (potentially) resubmit it with a specific list of fixes as a separate request.
- I also stated the list of links I'd be starting with above. This is from a specific request from Wikipedia:AutoWikiBrowser/Tasks#LeighRayment.com_.28continued.29. There are 2500+ of them. I have tested the first batch of them with local writes, but it's against WP policy to have a bot edit WP without test approval, so of course I haven't actually written them to WP. Isn't that the whole point of having a test period?
- Since its correcting DEAD links, I don't see any reason to limit its scope (although it does avoid editing archives), but I could easily change that if desired. --ThaddeusB (talk) 05:13, 15 December 2008 (UTC)
- So will this bot only be working on angeltowns.com links or is this request for broader approval? If it's for the former, this can probably be speedily approved. For the latter, it's going to require more time / testing / whatever. As to why anyone didn't mention that AWB's general fixes are problematic, well probably because most of BAG is either inactive or incompetent. /me shrugs. Though I do think AWB's documentation is pretty explicit about the 'danger' of general fixes. --MZMcBride (talk) 06:28, 15 December 2008 (UTC)
- I wrote the bot in order to correct the angeltown links, but I don't see any reason to limit its scope. I have written dozens of text-parsing scripts in the past and am well aware of the potential issues involved with unexpected input and such. I do realize AWB-style general fixed are difficult to implement correctly, but I feel I am up to the challenge. Nonetheless, I will drop that part of the request at this time. (I have seen bot approved for gen fixes in the past and didn't htink it would be an issue or I would never have added that part.) HTML links, however, do not present such issues. There just is not any realistic chance of a search for "http://www.somesite.com/directory/somedeadURL.htm" generating false positives outside of a few specific pages such as WP's list of dead links. If my bot works correctly on somewebsite.com, it will work on someotherwebsite.com assuming the input (supplied by me) is valid.
- I am well aware that ultimately I am responsible for every edit the bot makes, and will utilize the utmost care in what I tell it to fix. If a maliciously tell it t change every www.microsoft.com to www.myspamsite.com then obviously I'd be in trouble. But if that was my intention why would I even bother trying to get approval? --ThaddeusB (talk) 12:37, 15 December 2008 (UTC)
- I think this is wonderful. If it fixes angeltowns.com/town that is a great test in itself. Let's go guys. Kittybrewster ☎ 09:56, 15 December 2008 (UTC)
- So will this bot only be working on angeltowns.com links or is this request for broader approval? If it's for the former, this can probably be speedily approved. For the latter, it's going to require more time / testing / whatever. As to why anyone didn't mention that AWB's general fixes are problematic, well probably because most of BAG is either inactive or incompetent. /me shrugs. Though I do think AWB's documentation is pretty explicit about the 'danger' of general fixes. --MZMcBride (talk) 06:28, 15 December 2008 (UTC)
Can a member of BAG please explain exactly what they want me to do to prove this bot works correctly? I've tested it locally, answered every question here, released the source, and tried to be patient but no one seems to be willing to act. What do I need to do to get the ball rolling? --ThaddeusB (talk) 02:34, 18 December 2008 (UTC)
I would like to see the bot split into two parts: A read-only bot that identifies items that need replacing, and a change-bot that works off the generated list, with cautions that the change-bot would only make changes if the text to be changed and it's immediate surrounding text hadn't been edited in the meantime. There are at least three good reasons for this:
- It greatly reduced the risk of harm.
- It can be used for "identification" situations to quickly identify all occurrences of a particular URL or URL fragment for other uses, such as looking for patterns of spamming, etc. I'm not familiar with semi-automated editing tools, but in principle the generated list can be used as input to a semi-automated tool, leaving it to a human being to confirm or cancel each edit. This would be practical on only relatively short lists, maybe a few hundred or so.
- The read-only portion by definition does not need approval of the BAG, it can run as soon as it's written.
Once the two bots are working nicely separately, they can be interleaved, so as an item is added to the list, it is immediately processed and the edit is made.davidwr/(talk)/(contribs)/(e-mail) 19:50, 18 December 2008 (UTC)
There is already a bot request for which this bot would be useful: Wikipedia:Bot_requests#Bulk-replace_URL_for_Handbook_of_Texas_Online davidwr/(talk)/(contribs)/(e-mail) 19:50, 18 December 2008 (UTC)
- Hello, the way the bot is currently structured is as follows:
Find Links (this part is not yet written, but like you say doesn't actually require approval since it doesn't do any wiki writing)
- Gets a dead URL from a Wikipedia:Link_rot sub page
- Finds the last good version of said URL using archive.org or search engine cache
- Make sure it wasn't an ad page set up by a domain squatter; if so, find an older version
- See if the last version mentions a site move, if so check the move URL to make sure it is good & that the content matches
- If no move URL is found, perform a search engine search using block portions of the last good page to try and find where it moved
- Write recommended changes to file for review
Wait a week to insure the URL is indeed dead
Alternatively if a user (such as yourself) supplies an URL that needs changed, the URL can go directly into the "for review" stack
URL reviewed by me to make sure the recommended change is accurate, then its moved into a machine readable to be processed file
Processing
- Get URL + change(s) from file; change can require a simply test such as making sure "text" is in the page to be changed and make decisions based on those tests; the new text can be anything - presumably a URL or a template.
- Use Special:LinkSearch to find all instances of the URL on wikipedia - this list could be output to a file if you want
- Make changes using perl's s/ command; scope can be limited, if desired (i.e. to article space only, for example; Archives are always excluded). Alternatively, I could add a simply check to make sure both the old & new URL don't appear on the same page - that should remove any false positives - and write those cases to a file for manual review.
- Stop every so often for review by me to make sure everything is working OK.
- Let me know of any changes you'd like to see made. --ThaddeusB (talk) 20:34, 18 December 2008 (UTC)
- For the first month or so, human review is needed immediately before the "Make changes" is committed. This can be done in a batch mode like so:
- For each change, write the timestamp of the last version of the file, the old version of the file, and the updated version of the file to a holding area.
- After a suitable number of changes are in the holding file, manually review each change and mark it approved. A "suitable number" could be 1 change or an entire batch. Doing it 1 change at a time simulates assisted-editing tools like AutoWikiBrowser.
- For each approved change, verify there have been no intermediate edits and make the edit then move on to the next approved change. If possible, don't count intermediate edits that only affected other sections, i.e. make the change if at all possible, but abandon any change that looks like an edit-conflict and log it as a failure so it can be done over again.
- davidwr/(talk)/(contribs)/(e-mail) 21:04, 18 December 2008 (UTC)
- Thanks for the comments.
- I can certainly review the first X changes manually before uploading. However, the bot is capable of doing, for example, 1000 edits in under 3 hours (with standard rate limits applied); I certainly don't want to review 1000+ edits, let alone an entire month's worth. (I have already written and reviewed a # locally, but can also intermediately review some more.)
- I think you were actually talking about only having the edit conflict feature for the manually approval tests which is definitely wise. However, once it goes live, it would be pointless. I could pull the history and check for intermediate updates, but this would most likely actually take longer than the text parsing (which happens in a tiny fraction of a second.) I could, however, pull the history after an edit to just make sure there was no intermediate edit and auto-revert if there was any. LMK what you think. --ThaddeusB (talk) 21:26, 18 December 2008 (UTC)
- Thanks for the comments.
- For the first month or so, human review is needed immediately before the "Make changes" is committed. This can be done in a batch mode like so:
Approved for trial (100 edits or 8 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. —Reedy 22:10, 18 December 2008 (UTC)
- I will begin trial edits after I add the logging features requested below. --ThaddeusB (talk) 22:22, 18 December 2008 (UTC)
- On the auto-revert, that's a great idea, but be sure to log if the auto-revert failed for any reason so you could do manual cleanup. As for pointless, you may not be willing to review 1000 changes, but the person requesting the tool might want to review the changes before they were committed or possibly shortly after. Given restrictions on multi-user bots, a report listing both a "click here for diffs" plus the actual diffs in-line would be a handy thing to give to the requester: It's a lot easier to wade through a several-hundred-KB text file with page after page of diffs than it is to click on a few hundred links. Such a report would of course have a "click here for diff" link for each change, so the requester would have easy-access to do a manual diff and if necessary, manual undo or cleanup. davidwr/(talk)/(contribs)/(e-mail) 22:12, 18 December 2008 (UTC)
- Sure, I will add feature to log all changes to a file and upload them to the DeadLinkBot user space after every 50 or so edits. I'll send you a link for your project's logs after I add the feature. (I'll put 50 of my trial edits into your project and 50 into the angeltowns.com request I initially wrote this for.) --ThaddeusB (talk) 22:22, 18 December 2008 (UTC)
- Just a note: I'm going to play with AutoWikiBrowser to see if it's suitable for my project. I hear those tools can do 2000 or so edits an hour at full speed, which means it will take me less than 2 hours to go through the list. I'll leave 50 for you. Of course, I'll be slower than that until I become familiar with AWB, and I'll be rate-limited by the wiki software. It will be interesting to see which is faster per 50 edits: AWB or reviewing the edits after the fact. davidwr/(talk)/(contribs)/(e-mail) 22:26, 18 December 2008 (UTC)
Why not just use the appropriate options (basetimestamp
and starttimestamp
) to the API edit command to detect edit conflicts the normal way, instead of trying to do some odd "possibly overwrite others edits, and then try to self-revert" scheme? Anomie⚔ 23:59, 18 December 2008 (UTC)
- Whoops! I've been using perlwikipedia 1.0 since it is the "featured download" on Google code site linked to from here. It didn't support edit conflict detection (nor linksearch which I wrote code for myself). Your comment didn't make much sense to me, so I went and looked and its actually on version 1.5 now! I'm assuming this new version detects edit conflicts... I guess I better install that and read up on it instead of making some silly workaround. :) --ThaddeusB (talk) 04:54, 19 December 2008 (UTC)