Wikipedia:Bots/Requests for approval/DeadLinkBOT
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Contents
Automatic or Manually Assisted: (Mostly) Automatic, supervised
Programming Language(s): Perl
Function Summary: To correct dead links due to link rot
Edit period(s) (e.g. Continuous, daily, one time run): as needed
Already has a bot flag (Y/N): N
Function Details: DeadLinkBOT's purpose is to update links that are invalid due to link rot. The first version of the program will simply replace all instances of a user supplied link with an updated link (i.e. a change pre-approved by me). When needed, the program is capable of making simply determinations about the nature of the WP page in order to pick a new link from a list of alternatives (given user supplied rules). In the future, the program will be expanded to actively seek out updated links after retrieving a list of dead links to be updated. These more advanced changes will require user confirmation. When a page is edited to update a link, the bot will also apply AWB-like general fixes.
Discussion
edit- What about websites that go through regular downtime? If the bot reads them as dead while they are temporarily down, it will remove a good link. Xclamation point 05:01, 2 December 2008 (UTC)[reply]
- It will be attempting to fix 404 links found at Wikipedia:Linkrot. In theory, 404 errors are not due simply to downtime, but rather a page being renamed or moved. Per WP policy, the bot won't remove any link for which it can't find an alternative. I.E. It will only address links that have moved to a new location. These precautions should prevent any removal of temporary unavailable locations.
- The first version of the program will only change links specified in advance, starting with the 2200+ links here [1]. When I add the automatic updated link finding feature, the bot will double check proposed changes with me before making them --ThaddeusB (talk) 15:46, 2 December 2008 (UTC)[reply]
- Can you put in a double check of links, say, a week apart to ensure that it's not an short run 404 that caused the problem -- Tawker (talk) 07:17, 6 December 2008 (UTC)[reply]
- Yes, I will add that feature. --ThaddeusB (talk) 22:00, 7 December 2008 (UTC)[reply]
- Can you put in a double check of links, say, a week apart to ensure that it's not an short run 404 that caused the problem -- Tawker (talk) 07:17, 6 December 2008 (UTC)[reply]
- Is the source code to your bot available? — Carl (CBM · talk) 14:15, 6 December 2008 (UTC)[reply]
- I wasn't planning on releasing it for public consumption. --ThaddeusB (talk) 22:00, 7 December 2008 (UTC)[reply]
- Why not? If its going to be actually be changing links in articles I'd really like to know that the code is sound. Mr.Z-man 21:57, 12 December 2008 (UTC)[reply]
- Well considering its explicitly not required, I shouldn't have to justify my decision. But since you asked, my code is undocumented and "ugly" - it is not intended to be read by anyone but me. I really don't see what the issue is - all the program does as far as Wikipedia goes is substitute a pre-screened dead URL for a pre-screened good one, possibly applying pre-screened regrexes to pick between two or more different options. None-the-less, I put the code up anyway: User:DeadLinkBOT/source --ThaddeusB (talk) 23:45, 12 December 2008 (UTC)[reply]
- Why not? If its going to be actually be changing links in articles I'd really like to know that the code is sound. Mr.Z-man 21:57, 12 December 2008 (UTC)[reply]
- I wasn't planning on releasing it for public consumption. --ThaddeusB (talk) 22:00, 7 December 2008 (UTC)[reply]
I have tested the bot with local writes and all works according to plan. I would appreciate it if a trial could be approved for actual wiki editing soon. Thanks. --ThaddeusB (talk) 02:34, 14 December 2008 (UTC)[reply]
- This all seems rather sketchy to me. "I'm gonna go through all the articles and change a bunch of links. And... um... apply 'AWB-like' general fixes too." People applying AWB-like changes usually get banned pretty quickly because they rarely consider the large, large number of corner cases. And general fixes require someone to watch them and verify each edit to avoid things getting screwed up. As for the link changes, do you have any examples from articles? Edits you've done by hand (or even using this script)? And will you only be dealing with pages in namespace 0? --MZMcBride (talk) 04:41, 15 December 2008 (UTC)[reply]
- First of all, I don't appreciate the attitude. I said nothing like "I'm going to go through all articles and change a bunch of links." What I actually said is that I was to to change all specific instances of a known bad link to a known good link (using Special:LinkSearch). I also said several times that every change would be pre-approved by me. If there was a problem doing general fixes, then why was that never mentioned before now? This request is now 2 weeks old and this is the first I'm hearing of it potentially being a problem. I am certainly willing to drop that part of the request and (potentially) resubmit it with a specific list of fixes as a separate request.
- I also stated the list of links I'd be starting with above. This is from a specific request from Wikipedia:AutoWikiBrowser/Tasks#LeighRayment.com_.28continued.29. There are 2500+ of them. I have tested the first batch of them with local writes, but it's against WP policy to have a bot edit WP without test approval, so of course I haven't actually written them to WP. Isn't that the whole point of having a test period?
- Since its correcting DEAD links, I don't see any reason to limit its scope (although it does avoid editing archives), but I could easily change that if desired. --ThaddeusB (talk) 05:13, 15 December 2008 (UTC)[reply]
- So will this bot only be working on angeltowns.com links or is this request for broader approval? If it's for the former, this can probably be speedily approved. For the latter, it's going to require more time / testing / whatever. As to why anyone didn't mention that AWB's general fixes are problematic, well probably because most of BAG is either inactive or incompetent. /me shrugs. Though I do think AWB's documentation is pretty explicit about the 'danger' of general fixes. --MZMcBride (talk) 06:28, 15 December 2008 (UTC)[reply]
- I wrote the bot in order to correct the angeltown links, but I don't see any reason to limit its scope. I have written dozens of text-parsing scripts in the past and am well aware of the potential issues involved with unexpected input and such. I do realize AWB-style general fixed are difficult to implement correctly, but I feel I am up to the challenge. Nonetheless, I will drop that part of the request at this time. (I have seen bot approved for gen fixes in the past and didn't htink it would be an issue or I would never have added that part.) HTML links, however, do not present such issues. There just is not any realistic chance of a search for "http://www.somesite.com/directory/somedeadURL.htm" generating false positives outside of a few specific pages such as WP's list of dead links. If my bot works correctly on somewebsite.com, it will work on someotherwebsite.com assuming the input (supplied by me) is valid.
- I am well aware that ultimately I am responsible for every edit the bot makes, and will utilize the utmost care in what I tell it to fix. If a maliciously tell it t change every www.microsoft.com to www.myspamsite.com then obviously I'd be in trouble. But if that was my intention why would I even bother trying to get approval? --ThaddeusB (talk) 12:37, 15 December 2008 (UTC)[reply]
- I think this is wonderful. If it fixes angeltowns.com/town that is a great test in itself. Let's go guys. Kittybrewster ☎ 09:56, 15 December 2008 (UTC)[reply]
- So will this bot only be working on angeltowns.com links or is this request for broader approval? If it's for the former, this can probably be speedily approved. For the latter, it's going to require more time / testing / whatever. As to why anyone didn't mention that AWB's general fixes are problematic, well probably because most of BAG is either inactive or incompetent. /me shrugs. Though I do think AWB's documentation is pretty explicit about the 'danger' of general fixes. --MZMcBride (talk) 06:28, 15 December 2008 (UTC)[reply]
Can a member of BAG please explain exactly what they want me to do to prove this bot works correctly? I've tested it locally, answered every question here, released the source, and tried to be patient but no one seems to be willing to act. What do I need to do to get the ball rolling? --ThaddeusB (talk) 02:34, 18 December 2008 (UTC)[reply]
I would like to see the bot split into two parts: A read-only bot that identifies items that need replacing, and a change-bot that works off the generated list, with cautions that the change-bot would only make changes if the text to be changed and it's immediate surrounding text hadn't been edited in the meantime. There are at least three good reasons for this:
- It greatly reduced the risk of harm.
- It can be used for "identification" situations to quickly identify all occurrences of a particular URL or URL fragment for other uses, such as looking for patterns of spamming, etc. I'm not familiar with semi-automated editing tools, but in principle the generated list can be used as input to a semi-automated tool, leaving it to a human being to confirm or cancel each edit. This would be practical on only relatively short lists, maybe a few hundred or so.
- The read-only portion by definition does not need approval of the BAG, it can run as soon as it's written.
Once the two bots are working nicely separately, they can be interleaved, so as an item is added to the list, it is immediately processed and the edit is made.davidwr/(talk)/(contribs)/(e-mail) 19:50, 18 December 2008 (UTC)[reply]
There is already a bot request for which this bot would be useful: Wikipedia:Bot_requests#Bulk-replace_URL_for_Handbook_of_Texas_Online davidwr/(talk)/(contribs)/(e-mail) 19:50, 18 December 2008 (UTC)[reply]
- Hello, the way the bot is currently structured is as follows:
Find Links (this part is not yet written, but like you say doesn't actually require approval since it doesn't do any wiki writing)
- Gets a dead URL from a Wikipedia:Link_rot sub page
- Finds the last good version of said URL using archive.org or search engine cache
- Make sure it wasn't an ad page set up by a domain squatter; if so, find an older version
- See if the last version mentions a site move, if so check the move URL to make sure it is good & that the content matches
- If no move URL is found, perform a search engine search using block portions of the last good page to try and find where it moved
- Write recommended changes to file for review
Wait a week to insure the URL is indeed dead
Alternatively if a user (such as yourself) supplies an URL that needs changed, the URL can go directly into the "for review" stack
URL reviewed by me to make sure the recommended change is accurate, then its moved into a machine readable to be processed file
Processing
- Get URL + change(s) from file; change can require a simply test such as making sure "text" is in the page to be changed and make decisions based on those tests; the new text can be anything - presumably a URL or a template.
- Use Special:LinkSearch to find all instances of the URL on wikipedia - this list could be output to a file if you want
- Make changes using perl's s/ command; scope can be limited, if desired (i.e. to article space only, for example; Archives are always excluded). Alternatively, I could add a simply check to make sure both the old & new URL don't appear on the same page - that should remove any false positives - and write those cases to a file for manual review.
- Stop every so often for review by me to make sure everything is working OK.
- Let me know of any changes you'd like to see made. --ThaddeusB (talk) 20:34, 18 December 2008 (UTC)[reply]
- For the first month or so, human review is needed immediately before the "Make changes" is committed. This can be done in a batch mode like so:
- For each change, write the timestamp of the last version of the file, the old version of the file, and the updated version of the file to a holding area.
- After a suitable number of changes are in the holding file, manually review each change and mark it approved. A "suitable number" could be 1 change or an entire batch. Doing it 1 change at a time simulates assisted-editing tools like AutoWikiBrowser.
- For each approved change, verify there have been no intermediate edits and make the edit then move on to the next approved change. If possible, don't count intermediate edits that only affected other sections, i.e. make the change if at all possible, but abandon any change that looks like an edit-conflict and log it as a failure so it can be done over again.
- davidwr/(talk)/(contribs)/(e-mail) 21:04, 18 December 2008 (UTC)[reply]
- Thanks for the comments.
- I can certainly review the first X changes manually before uploading. However, the bot is capable of doing, for example, 1000 edits in under 3 hours (with standard rate limits applied); I certainly don't want to review 1000+ edits, let alone an entire month's worth. (I have already written and reviewed a # locally, but can also intermediately review some more.)
- I think you were actually talking about only having the edit conflict feature for the manually approval tests which is definitely wise. However, once it goes live, it would be pointless. I could pull the history and check for intermediate updates, but this would most likely actually take longer than the text parsing (which happens in a tiny fraction of a second.) I could, however, pull the history after an edit to just make sure there was no intermediate edit and auto-revert if there was any. LMK what you think. --ThaddeusB (talk) 21:26, 18 December 2008 (UTC)[reply]
- Thanks for the comments.
- For the first month or so, human review is needed immediately before the "Make changes" is committed. This can be done in a batch mode like so:
Trial
editApproved for trial (100 edits or 8 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. —Reedy 22:10, 18 December 2008 (UTC)[reply]
- I will begin trial edits after I add the logging features requested below. --ThaddeusB (talk) 22:22, 18 December 2008 (UTC)[reply]
- On the auto-revert, that's a great idea, but be sure to log if the auto-revert failed for any reason so you could do manual cleanup. As for pointless, you may not be willing to review 1000 changes, but the person requesting the tool might want to review the changes before they were committed or possibly shortly after. Given restrictions on multi-user bots, a report listing both a "click here for diffs" plus the actual diffs in-line would be a handy thing to give to the requester: It's a lot easier to wade through a several-hundred-KB text file with page after page of diffs than it is to click on a few hundred links. Such a report would of course have a "click here for diff" link for each change, so the requester would have easy-access to do a manual diff and if necessary, manual undo or cleanup. davidwr/(talk)/(contribs)/(e-mail) 22:12, 18 December 2008 (UTC)[reply]
- Sure, I will add feature to log all changes to a file and upload them to the DeadLinkBot user space after every 50 or so edits. I'll send you a link for your project's logs after I add the feature. (I'll put 50 of my trial edits into your project and 50 into the angeltowns.com request I initially wrote this for.) --ThaddeusB (talk) 22:22, 18 December 2008 (UTC)[reply]
- Just a note: I'm going to play with AutoWikiBrowser to see if it's suitable for my project. I hear those tools can do 2000 or so edits an hour at full speed, which means it will take me less than 2 hours to go through the list. I'll leave 50 for you. Of course, I'll be slower than that until I become familiar with AWB, and I'll be rate-limited by the wiki software. It will be interesting to see which is faster per 50 edits: AWB or reviewing the edits after the fact. davidwr/(talk)/(contribs)/(e-mail) 22:26, 18 December 2008 (UTC)[reply]
Why not just use the appropriate options (basetimestamp
and starttimestamp
) to the API edit command to detect edit conflicts the normal way, instead of trying to do some odd "possibly overwrite others edits, and then try to self-revert" scheme? Anomie⚔ 23:59, 18 December 2008 (UTC)[reply]
- Whoops! I've been using perlwikipedia 1.0 since it is the "featured download" on Google code site linked to from here. It didn't support edit conflict detection (nor linksearch which I wrote code for myself). Your comment didn't make much sense to me, so I went and looked and its actually on version 1.5 now! I'm assuming this new version detects edit conflicts... I guess I better install that and read up on it instead of making some silly workaround. :) --ThaddeusB (talk) 04:54, 19 December 2008 (UTC)[reply]
Trial complete
editTrial complete. I rewrote the program to query the API directly (rather that using perlwikipedia.pm or an equivalent). This enabled more efficient resource usage and the ability to correctly detect edit conflicts. However, it did lead to some temporary bugs. Most embarrassingly, the bot's first 5 edits blanked pages due to a variable being mistyped. (Doh!) Of course, I promptly fixed any errors the bot made and corrected the code to avoid repeating them. :)
The bot can now detect edit conflicts and false positives (e.g. on talk pages), although neither arose in the trial period. It ignores Wikipedia: space articles (excluding WikiProject pages), archives, sandboxes, and pages in its own userspace.
After everything was working, DeadLinksBOT made just under 50 edits correcting angeltown.com links. A log of these edits can be found at User:DeadLinkBOT/Logs/AngelTowns.log. I have manually reviewed them and have also invited Kittybrewster to review and comment here.
Here is a representative sample of the kinds of corrections it can routinely make:
- Boring 1:1 URL replacement [2]
- URL replacement based on regrex [3]
- multiple related URLs corrected on same page [4]
- URL -> simple "permanent link" template (template chosen from small list based on article's title & contents) [5]
- URL -> two simple templates based on the subject being both a baron and a member of parliament [6]
Collectively, these edits are represent both the typical workload of the bot (straight URL replacement) and the most complicated case that will regularly arise (transition to simple template). I am confident that the bot will be 99%+ accurate with these edits.
During the trial, I used the other 50 approved edits to parse a much more difficult situation that the bot would typically face - transitioning a dead URL to a complicated template (Handbook of Texas). This change uses a custom function that no other changes will use, so its accuracy is independent of the normal functional accuracy. Since the parsing is fairly complex, I have had to make several changes to it so the edit history (User:DeadLinkBOT/Logs/HandbookOfTexas.log) is not completely representative of the current functionality. In particular, t he bot made several errors that it would no longer make. (All changes were manually verified and corrected when needed.) The bot should be much closer to fully accurate now, but all changes will be manually verified for the foreseeable future.
Here is a representative sample of the kinds of corrections it can make:
- <ref> transitioned to {{Handbook of Texas}} template with missing information filled in by retrieving the handbook page [7]
- named reference transitioned with name left intact [8]
- external link transitioned to simpler version of HT template (no author/dates) [9]
- bare link transitioned to template with <ref> tags added [10]
- bare link with title updated (b/c its not a reference, but rather part of the text); also malformed template corrected [11]
- bare link on talk page simply updated (not appropriate to transition to template) [12]
Again, these edits are not typical but rather representative of the most complicated edits the bot would ever do. If the need for this sort of change ever arises again, I would of course be manually verifying everything again. I have explained my methodology to davidwr and invited him to comment here. --ThaddeusB (talk) 07:14, 24 December 2008 (UTC)[reply]
- It works incredibly well. Congratulations and thank you very much. Kittybrewster ☎ 08:52, 24 December 2008 (UTC)[reply]
- Something to consider for the future: add a "test" switch to the bot where it will save the proposed edits to its local hard drive instead of actually editing Wikipedia. You could then use diff, wdiff, and the like to make sure the edit is correct before running the bot for real. Anomie⚔ 14:25, 24 December 2008 (UTC)[reply]
- I did write some edits to file, but I didn't think to use diff. I manually compared the before and after, which made it easy for me to miss errors. I certainly will use diff in the future - that will help a lot. --ThaddeusB (talk) 15:57, 24 December 2008 (UTC)[reply]
- Overall good work. I confess I haven't had time to go back behind you and audit everything, I'll take your word for it that you did a good job auditing the results. However, I did find a couple of issues:
- There is a problem with non-ASCII character encoding: Pages with unusual characters do not log properly. The change to Texas–Indian Wars logged as TexasâIndian Wars. The change to Alonso Álvarez de Pineda logged as Alonso lvarez de Pineda. While this particular error is of no great consequences, please check the code for similar errors that may be more consequential.
- Yah, its a problem specific to the log. It should be easy to clear up but I didn't bother since the diff links work right. I'll go ahead and fix it now. --ThaddeusB (talk) 15:57, 24 December 2008 (UTC)[reply]
- Some people don't like user page material modified. Consider immediately self-reverting any change made in User: space and putting a note on the user's talk page pointing to the changed diff, and leave it up to them whether or not to commit the change. Alternatively, don't self-revert but do drop the user a note. I know most bots treat user pages the same as article space, but it's a trend I'd like to see change. davidwr/(talk)/(contribs)/(e-mail) 14:49, 24 December 2008 (UTC)[reply]
- I agree with the first point. But not, in the case of dead links, with the second. Kittybrewster ☎ 15:24, 24 December 2008 (UTC)[reply]
- I'll add a feature to drop the user a courtesy note on their talk page. --ThaddeusB (talk) 15:57, 24 December 2008 (UTC)[reply]
- Your decision. I don't like the idea of leaving dead link lying around clogging up the internet and wikipedia. Maybe the answer is to have the bot change it and add a note saying this has been done. Kittybrewster ☎ 16:46, 24 December 2008 (UTC)[reply]
- Yah, that's what I meant. The bot will still change the link but will leave a courtesy note informing the user of the change. --ThaddeusB (talk) 19:35, 24 December 2008 (UTC)[reply]
- Your decision. I don't like the idea of leaving dead link lying around clogging up the internet and wikipedia. Maybe the answer is to have the bot change it and add a note saying this has been done. Kittybrewster ☎ 16:46, 24 December 2008 (UTC)[reply]
Statements like "99%+ accurate" are meaningless, since there'll always be a user who does the unexpected and sets examples for others to follow. So anyway I'm author and maintainer of the Checklinks tool and PDFbot. Checklinks detects, lists, and allows user repairs of dead links on pages, it is mostly used as a link checker on article review. PDFbot had been approved for similar dead link repair; however, it actually checks every link it replaces to make sure it works.
So here are some of the cavoits
- replacement of example.com -> example.org no replacement should happen http://web.archive.org/web/*/http://example.com
- bracketed link are match differently from free links, free links match different depending on if there's a "(" in them. Are you going to replace free links?
- http://www2.jsonline.com/story/index.aspx?id=279432 will on occation return the status code 404
- Many site use soft 404, matching these are hard. Some only by content anaylsis, see Wikipedia:Bots/Requests for approval/DumZiBoT for some details
- Does the bot remove {{dead link}} when replacing the dead links?
That is all I can think of at the moment, if this bot is approved can it simply nytimes link from
[13] to [14] to remove the login requirements. — Dispenser 18:47, 31 December 2008 (UTC)[reply]
- First of all, thank you for your insight. I used the phrase 99% accurate because I couldn't think of any way it would fail, but there is always a possibility of something wacky happening. With your insights, I was able to eliminate some unlikely, but possible situations...
- As currently programmed, the bot will replace something like http://whatever.com/page.htm with http://newsite.com/page.htm (no bracket->no brackes) when in "standard URL replacement" mode. This way somethign goofy like [http://oldsite.com http://oldsite.com] doesn't change into [http://newsite.com http://oldsite.com] (it normally leaves titles unchanged) but rather [http://newsite.com http://newsite.com].
- In "URL -> template" mode, it will usually replace these types of bare with the desired template, but it can also leave them untouched or just change them to new bare URLs, depending on how the rule for the change is set up. (I believe Special:Linksearch does pick up these kind of "bare" URLs; if it doesn't my bot won't find the page though).
- I have now added a clause that these bare URLs must be proceed by a punction mark (! ? . , ' "), space, }, |, or >. Thus if it is part of a larger URL it won't be picked up. Although it was unlikely that a page would have an archive.org link and one to old URL directly, it doesn't hurt to fix this problem and anything similiar by being explicit. :)
- All old_link->new_link rules are manually reviewed before being sent to the bot for processing, so I'll just ignore anything from jsonline.com.
- There are plenty of normal 404s to work through, so I'll be ignoring "soft" ones for the time being.
- I was previously unaware of the {{dead link}} template, but now that I am I've added code to remove the template (or its aliases) from a corrected link (as long the link & template are seperated only by whitespace).
- Changing NYT links is, technically, quite simply. However, I am not quite sure what you wanted exactly:
- http://www10.nytimes.com/(date)/(scope)/article.htm?(bunch of junk) -> http://www.nytimes.com/(date)/(scope)/article.htm
- http://www(##).nytimes.com/(date)/(scope)/article.htm?(bunch of junk) -> http://www.nytimes.com/(date)/(scope)/article.htm [all numbers]
- http://www*.nytimes.com/(date)/(scope)/article.htm?(bunch of junk) -> http://www.nytimes.com/(date)/(scope)/article.htm [include www.]
- http://*nytimes.com/(date)/(scope)/article.htm?(bunch of junk) -> http://www.nytimes.com/(date)/(scope)/article.htm [include links without subdomain]
- http://www(##).nytimes.com/(date)/(scope)/article.htm* -> http://www.nytimes.com/(date)/(scope)/article.htm [include articles without the paramaters]
- Basically, I need to know what causes the login screen to be generated.
- Any further question, just ask. :) --ThaddeusB (talk) 23:03, 31 December 2008 (UTC)[reply]
- You should update the source code. Also, you can probably make good use of Checklinks source code.
- The best way to ensure that the url isn't part of another is to use the look behind
(?<!\w://[^][<>\s"]*)
- By manually review, do you mean that the look at each replacement or just look to make sure it makes sense for most of them? I would prefer automated review in addition to any manual.
- I think the login is caused by the oref=slogin but would like the url to be simple, see http://no-www.org/ and URL normalization.
- Does the bot ignore nowiki, comment, includeonly, source tags?
- The best way to ensure that the url isn't part of another is to use the look behind
- — Dispenser 22:17, 7 January 2009 (UTC)[reply]
- Yes, I use lookbehind sorry if that wasn't clear. The snippet you provided won't actually work in Perl since it doesn't have variable length lookbehind, but my positive lookbehind
(?<=[\s!?.,'"}|>*])
should be functionally equivalent. - I mean that I make sure the change is valid in general (for domain moves). Obviously if the original link was mistyped or something the new one will still be wrong, but it will be less wrong. Why leave a link to a dead domain unchanged just because it was mistyped? There is no (reasonable) way for a bot to fix such links, but at least it will be easier for a human to fix if they at least know where to look. In the rare case when a page simply moved locations on the same domain, only the exact page match would be changed.
- LinkSearch will only find actual links, not comments and such, so no they wouldn't be corrected.
- I'll put the NYT links on my to-do list for when the bot is approved.
- Yes, I use lookbehind sorry if that wasn't clear. The snippet you provided won't actually work in Perl since it doesn't have variable length lookbehind, but my positive lookbehind
- --ThaddeusB (talk) 22:40, 7 January 2009 (UTC)[reply]
- You should update the source code. Also, you can probably make good use of Checklinks source code.
Approval?
editAny chance of getting this approved soon? The trial ended almost 2 weeks ago and I have addressed all the concerns raised. I'd like to get started on fixing up more dead links soon. Thanks. --ThaddeusB (talk) 23:45, 4 January 2009 (UTC)[reply]
- I'm somewhat against the concept. I understand its need, but we really should not be expending human effort in cleaning up after other people's intentional messes. The method that we should be employing is to email the webmaster (possibly with a list of broken URLs) and point them to some guide on how to setup the server with proper redirection and some guidance on a good URL scheme (like DOI and such). Broken URLs affect everyone not just us. — Dispenser 21:23, 7 January 2009 (UTC)[reply]
- It is Wikipedia policy to repair dead links already. All my bot does is reduce the amount of human effort needed to do so. I'm not creating policy, just trying to automate some tedious work.
- Websites change their location for a variety of reasons. Not everyone wants to continue to pay for an old domain/hosting just for redirect purposes. I agree that websites should use redirects instead of just "disappearing" but the fact is that they don't always. I can't change the world, only make do with the way it is. Besides, even websites that do have redirects from their old locations usually don't keep the old links valid indefinitely and almost always ask visitors to update the link that brought the there. --ThaddeusB (talk) 22:20, 7 January 2009 (UTC)[reply]
{{BAGAssistanceNeeded}} --Tinucherian 12:04, 8 January 2009 (UTC)[reply]
I've got a few more user submitted link updates to work on now. Now >5000 links waiting to be updated. Hoping to get started soon, ThaddeusB (talk) 13:26, 8 January 2009 (UTC)[reply]
- Approved. --Chris 23:52, 8 January 2009 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.
- Wait, wait, wait. Sorry, that I don't have the time (or writing inclination) to commit to WP 24/7, but ThaddeusB response was flawed. Wikipedia:Dead external links isn't a policy it just a unmaintained project, arbitrarily moving dead resources without checking it is a bad idea since it eliminates Wayback history data and will introduces problems. — Dispenser 07:15, 9 January 2009 (UTC)[reply]
- Strongly disagree. This is human overseen, is much needed and fully complies with wikipolicy. It is not arbitrary and leaves a trace in the history. Kittybrewster ☎ 09:17, 9 January 2009 (UTC)[reply]
- The only parts that are overseen were the approval of link changing this is no substitute when the replacement algorithm is flawed? He hasn't shown that it is harmless. It modified comments, nowiki tags, skips citations that embed links in < >.
- The approval was granted in a 26.5 hours after the last post by me, for reasons of edit waring, I do not immediately respond to comments. I was out yesterday doing some work so I could respond till the late evening. So it is wikipolicy that a bot can make harmful edits, which are unsearchable in wikiblame. Why was the request of my review of the source code not taken in account? So if I dumped a list of bad pages in 3-4 months Kittybrewster go through the history and find who edited them? — Dispenser 15:34, 9 January 2009 (UTC)[reply]
- The above comment is factually inaccurate and unnecessarily rude. The bot has *not* "modified comments, nowiki tags, [or skiped] citations that embed links in < >" and even if it did it wouldn't be harmful; the link is out-of-date whether it is clickable or not. Every change the bot makes is logged with easily clickable diffs at User:DeadLinkBOT/Logs and every change is being manually review by me to iron out any bugs. (I've stated this particular fact several time now.) Please DO raise any actual errors the bot makes either on its talk page or mine, but this endless speculation of how its going to harm Wikipedia is getting old.
- This request had been open for 2 months and Dispenser is the only person to object. No one user gets veto rights and the task is clearly desirable, despite Dispenser's insistence that it isn't Wikipedia policy to fix dead links. --ThaddeusB (talk) 16:34, 9 January 2009 (UTC)[reply]
- I want the bot, but I want it to work right. My frustration is in BAG's strange and/or sudden approval in bot processes, often without much warning. The "speculation" was based on the last release of the source code. — Dispenser 17:25, 9 January 2009 (UTC)[reply]
- Strongly disagree. This is human overseen, is much needed and fully complies with wikipolicy. It is not arbitrary and leaves a trace in the history. Kittybrewster ☎ 09:17, 9 January 2009 (UTC)[reply]
- Wait, wait, wait. Sorry, that I don't have the time (or writing inclination) to commit to WP 24/7, but ThaddeusB response was flawed. Wikipedia:Dead external links isn't a policy it just a unmaintained project, arbitrarily moving dead resources without checking it is a bad idea since it eliminates Wayback history data and will introduces problems. — Dispenser 07:15, 9 January 2009 (UTC)[reply]