Operator: Cyberpower678 ( talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 13:37, Saturday, June 6, 2015 ( UTC)
Automatic, Supervised, or Manual: Automatic and Supervised
Programming language(s): PHP
Source code available: Here
Function overview: Replace existing tagged links as dead with a viable copy of an archived page.
Links to relevant discussions (where appropriate): Here
Edit period(s): Daily, but will likely look it will run continuously.
Estimated number of pages affected: 130,000 to possibly a million.
Exclusion compliant (Yes/No): Yes
Already has a bot flag (Yes/No): Yes
Function details: The bot will crawl its way through articles on Wikipedia and attempt to retrieve an archived copy of dead-links at the time closest to original access date, if specified. To avoid persistent edit-warring, users have the option of placing a blank, non-breaking, {{
cbignore}}
tag on the affected to tell Cyberbot to leave it alone. If the bot makes any changes to the page, a talk page notice is placed alerting the editors there that Cyberbot has tinkered with a ref.
The bots detecting of a dead-link needs to be carefully thought out to avoid false positives, such as temporary site outage. Feel free to suggest some algorithms to add to this detection function. At current the plan is to check for a 200 OK response in the header. If any kind of response that indicates downage, the bot proceeds to add the archived link if available, or otherwise tags it as dead. A rule mechanism can be added to the configurations for sites that follow certain rules when the kill a link.
There is a configuration page that allows the bot to be configured to desired specifications, which can be seen at User:Cyberbot II/Dead-links. The bot attempts to parse various ways references have been formatted and attempts to keep consistent as to not destroy the citation. Even though the option to not touch an archived source is available, Cyberbot II will attempt to repair misformatted sources using archives if it comes across any.
Any link/source that is still alive, Cyberbot can check for an available archive copy, and the request site be archived, if it can't find any.
The bot can forcibly verify if the link is actually dead, or be set to blindly trust references tagged as dead.
The bot may need some further developing depending on what additional issues crop up, but is otherwise ready to be tested.
I think this is a great idea. One thought: there are several kinds of dead links - (a) sometimes the site is completely defunct and the domain simply doesn't work - there is no server there any more, (b) sometimes the site has been bought by another entity and whatever used to be there isn't there any more, so most things get a 404, (c) sometimes a news story is removed and now gets a 404, or (d) sometimes a news story is removed and is now a 30x redirect to another page.
For a, b, or c, what you are describing is a great idea and probably completely solves the problem. For (d), it may be tricky to resolve whether this is really a dead link or whether they merely relocated the article.
One thought/idea: can you have a maintainable list of newspapers that are known to only leave their articles available online for a certain amount of time? The Roanoke Times for example, I think only leaves things up for maybe six months. Sometimes, they might redirect you to a list of other articles by the same person, e.g. [1] which was a specific article by Andy Bitter and now takes you to a list of Andy Bitter's latest articles. Other times you just get a 404, e.g. [2]. Since links from roanoke.com are completely predictable that they will disappear after six months, you could automatically replace 302s, whereas for some other sites, you might tag it for review instead of making the replacement on a 302. An additional possible enhancement would be that, knowing that the article is going to disappear in six months, you could even submit it to one of the web citation places so that we can know it will be archived, even if archive.org misses a particular article. -- B ( talk) 18:38, 9 June 2015 (UTC) reply
Since you mentioned this, how do you avoids things like "temporary site outage"? Links go temporary bad very often. Sometimes there's a DNS issues. Sometimes regional servers or cache servers are down. Sometimes clouds are having issues. There's scheduled maintenances and general user errors. It is definitely unreliable to check a link only once.
I don't want to repeat all the comments from previous BRFAs, but there's tons of exceptions that you have to monitor. Like I've had sites return 200 and a page missing error just as returning 404 and valid content. I've had sites ignore me because of not having some expected user agent, allowing/denying cookies, having/not having referrer, being from certain region, not viewing ads, not loading scripts, not redirecting or redirecting to a wrong place or failing redirect in scripts, HEAD and GET returning different results, and a hundred other things. — HELLKNOWZ ▎ TALK 18:03, 24 June 2015 (UTC) reply
{{
cbignore}}
tag to the citation to tell the bot to go away.—
cyberpower
Chat:Limited Access 18:46, 24 June 2015 (UTC)
reply
|deadurl=no
, which implies it is not dead at this time. There was brief discussion on this (can't really recall where), and sending a user to a slower, cached version when a live one is available was deemed "bad". I would say you need consensus for making archive links the default links when bot has known detection errors and links may be live. It may be low enough that people don't care as long as there are archives for really dead links. —
HELLKNOWZ ▎
TALK 19:35, 24 June 2015 (UTC)
reply
{{
cbignore}}
.—
cyberpower
Chat:Online 20:03, 24 June 2015 (UTC)
reply
"If the bot makes any changes to the page, a talk page notice is placed alerting the editors there that Cyberbot has tinkered with a ref." -- Is there consensus for this? That's a lot of messages. — HELLKNOWZ ▎ TALK 21:25, 24 June 2015 (UTC) reply
The main problem I see with this is automatically trying to identify whether a link is up or down. It's ridiculously tough for a bot to do it (reflinks had a ton of code for it), and IIRC sites like CNN and/or NYT blocked the toolserver in the past. I also don't see any advantage to using a special exclusion template and spamming talk pages. I also had written my own code for this ( BRFA) which I'll resuscitate. It'll be great to have multiple bots working on this! Legoktm ( talk) 22:22, 26 June 2015 (UTC) reply
{{nobots|deny=InternetArchiveBot}}
. As for checking whether a link is dead or not, it seems to be an agreement among us to leave that feature off for now, or indefinitely. As spamming talk pages, we can see how that works out. If it's too much after the trial, we can turn that off too.—
cyberpower
Chat:Online 23:16, 26 June 2015 (UTC)
replyWhat should we do here? -- Magioladitis ( talk) 13:54, 28 June 2015 (UTC) reply
Approved for trial (100 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. ·addshore· talk to me! 15:52, 28 June 2015 (UTC) reply
That's all from me - only the first four things are potentially problematic. Jo-Jo Eumerus ( talk) 18:53, 28 June 2015 (UTC) reply
Approved for extended trial (300 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. 500 is too much for us to check. Let's do 300 first. -- Magioladitis ( talk) 08:09, 5 July 2015 (UTC) reply
I've only done tests on a few articles to find the worst bugs. These are not all, but those I found when checking a selective number of articles to see its reliability. Saying "the bot can't know if it is dead on Wayback" is not a good excuse. That is a reason not to allow the bot task.
Legend for recurring errors:
Code | Error |
---|---|
(b) | The source URL was not dead. |
(c) | The archive-url is dead. |
Diff | URL | Archived | Note |
---|---|---|---|
[4] | [5] | [6] | (c) THE BOT REPEATED EDIT, AFTER BEEN REVERTED |
[7] | [8] | [9] | (c) THE BOT REPEATED EDIT, AFTER BEEN REVERTED |
[10] | [11] | [12] | (c) THE LINK WAS INLINE, NOT IN-REF OR UNDER EXTERNAL LINKS |
[13] | [14] | [15] | (c) |
[16] | [17] | [18] | (c) |
[19] | - | - | Added |dead-url=yes, even though |deadurl=yes already existed |
[20] | [21] | [22] | (c) |
^ | [23] | [24] | (b) |
^ | [25] | [26] | (c) |
[27] | [28] | [29] | (c) |
[30] | [31] | [32] | (b) |
[33] | - | - | REMOVED CONTENT FROM THE ARTICLE |
[34] | - | - | TRIED TO FIX STRAY REF IN COMMENTED TEXT, BREAKING TEMPLATE, REMOVING CONTENT |
[35] | - | - | REMOVED CONTENT FROM THE ARTICLE |
( t) Josve05a ( c) 16:18, 5 July 2015 (UTC) reply
{{
OperatorAssistanceNeeded|D}}
Magioladitis (
talk) 22:46, 7 July 2015 (UTC)
reply
The bot appears to be ready for one last trial before approval.— cyberpower Chat:Offline 06:04, 11 July 2015 (UTC) reply
The Earwig here you are! 500 pages :) Josve05a you may also want to have a look! -- Magioladitis ( talk) 08:53, 15 July 2015 (UTC) reply
Josve05a did you had the chance to check (some of) the 500 edits? -- Magioladitis ( talk) 14:32, 24 July 2015 (UTC) reply
{{{manually_checked}}}
or something is more "accurate", but it sounds like a plan. Has my 'vote'. (
t)
Josve05a (
c) 19:16, 6 August 2015 (UTC)
reply
{{{checked}}}
?—
cyberpower
Chat:Online 19:53, 6 August 2015 (UTC)
reply
Here's an example. {{ sourcecheck}}
{{ BAGAssistanceNeeded}} I recommend that this bot task be approved, on the condition that the template above are implemented. In case a bug arises which breaks a page, or changes page layout in any way the bot shall be turned off and not be turned on again until the bug has been fixed, in order to not break more pages. This should not be conditional. ( t) Josve05a ( c) 03:23, 8 August 2015 (UTC) reply
|checked=false
. This might be a problem given the categorization. Also, I'm not sure if requiring (or recommending, at the very least) manual intervention on over a hundred thousand talk pages is a good idea.Approved. Cyberpower has removed the message regarding un-archivable links. To the best of my knowledge, that was only remaining issue. Future feature requests, such as detecting unmarked dead links, should be made under a subsequent BRFA. — Earwig talk 00:24, 25 August 2015 (UTC) reply
Operator: Cyberpower678 ( talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 13:37, Saturday, June 6, 2015 ( UTC)
Automatic, Supervised, or Manual: Automatic and Supervised
Programming language(s): PHP
Source code available: Here
Function overview: Replace existing tagged links as dead with a viable copy of an archived page.
Links to relevant discussions (where appropriate): Here
Edit period(s): Daily, but will likely look it will run continuously.
Estimated number of pages affected: 130,000 to possibly a million.
Exclusion compliant (Yes/No): Yes
Already has a bot flag (Yes/No): Yes
Function details: The bot will crawl its way through articles on Wikipedia and attempt to retrieve an archived copy of dead-links at the time closest to original access date, if specified. To avoid persistent edit-warring, users have the option of placing a blank, non-breaking, {{
cbignore}}
tag on the affected to tell Cyberbot to leave it alone. If the bot makes any changes to the page, a talk page notice is placed alerting the editors there that Cyberbot has tinkered with a ref.
The bots detecting of a dead-link needs to be carefully thought out to avoid false positives, such as temporary site outage. Feel free to suggest some algorithms to add to this detection function. At current the plan is to check for a 200 OK response in the header. If any kind of response that indicates downage, the bot proceeds to add the archived link if available, or otherwise tags it as dead. A rule mechanism can be added to the configurations for sites that follow certain rules when the kill a link.
There is a configuration page that allows the bot to be configured to desired specifications, which can be seen at User:Cyberbot II/Dead-links. The bot attempts to parse various ways references have been formatted and attempts to keep consistent as to not destroy the citation. Even though the option to not touch an archived source is available, Cyberbot II will attempt to repair misformatted sources using archives if it comes across any.
Any link/source that is still alive, Cyberbot can check for an available archive copy, and the request site be archived, if it can't find any.
The bot can forcibly verify if the link is actually dead, or be set to blindly trust references tagged as dead.
The bot may need some further developing depending on what additional issues crop up, but is otherwise ready to be tested.
I think this is a great idea. One thought: there are several kinds of dead links - (a) sometimes the site is completely defunct and the domain simply doesn't work - there is no server there any more, (b) sometimes the site has been bought by another entity and whatever used to be there isn't there any more, so most things get a 404, (c) sometimes a news story is removed and now gets a 404, or (d) sometimes a news story is removed and is now a 30x redirect to another page.
For a, b, or c, what you are describing is a great idea and probably completely solves the problem. For (d), it may be tricky to resolve whether this is really a dead link or whether they merely relocated the article.
One thought/idea: can you have a maintainable list of newspapers that are known to only leave their articles available online for a certain amount of time? The Roanoke Times for example, I think only leaves things up for maybe six months. Sometimes, they might redirect you to a list of other articles by the same person, e.g. [1] which was a specific article by Andy Bitter and now takes you to a list of Andy Bitter's latest articles. Other times you just get a 404, e.g. [2]. Since links from roanoke.com are completely predictable that they will disappear after six months, you could automatically replace 302s, whereas for some other sites, you might tag it for review instead of making the replacement on a 302. An additional possible enhancement would be that, knowing that the article is going to disappear in six months, you could even submit it to one of the web citation places so that we can know it will be archived, even if archive.org misses a particular article. -- B ( talk) 18:38, 9 June 2015 (UTC) reply
Since you mentioned this, how do you avoids things like "temporary site outage"? Links go temporary bad very often. Sometimes there's a DNS issues. Sometimes regional servers or cache servers are down. Sometimes clouds are having issues. There's scheduled maintenances and general user errors. It is definitely unreliable to check a link only once.
I don't want to repeat all the comments from previous BRFAs, but there's tons of exceptions that you have to monitor. Like I've had sites return 200 and a page missing error just as returning 404 and valid content. I've had sites ignore me because of not having some expected user agent, allowing/denying cookies, having/not having referrer, being from certain region, not viewing ads, not loading scripts, not redirecting or redirecting to a wrong place or failing redirect in scripts, HEAD and GET returning different results, and a hundred other things. — HELLKNOWZ ▎ TALK 18:03, 24 June 2015 (UTC) reply
{{
cbignore}}
tag to the citation to tell the bot to go away.—
cyberpower
Chat:Limited Access 18:46, 24 June 2015 (UTC)
reply
|deadurl=no
, which implies it is not dead at this time. There was brief discussion on this (can't really recall where), and sending a user to a slower, cached version when a live one is available was deemed "bad". I would say you need consensus for making archive links the default links when bot has known detection errors and links may be live. It may be low enough that people don't care as long as there are archives for really dead links. —
HELLKNOWZ ▎
TALK 19:35, 24 June 2015 (UTC)
reply
{{
cbignore}}
.—
cyberpower
Chat:Online 20:03, 24 June 2015 (UTC)
reply
"If the bot makes any changes to the page, a talk page notice is placed alerting the editors there that Cyberbot has tinkered with a ref." -- Is there consensus for this? That's a lot of messages. — HELLKNOWZ ▎ TALK 21:25, 24 June 2015 (UTC) reply
The main problem I see with this is automatically trying to identify whether a link is up or down. It's ridiculously tough for a bot to do it (reflinks had a ton of code for it), and IIRC sites like CNN and/or NYT blocked the toolserver in the past. I also don't see any advantage to using a special exclusion template and spamming talk pages. I also had written my own code for this ( BRFA) which I'll resuscitate. It'll be great to have multiple bots working on this! Legoktm ( talk) 22:22, 26 June 2015 (UTC) reply
{{nobots|deny=InternetArchiveBot}}
. As for checking whether a link is dead or not, it seems to be an agreement among us to leave that feature off for now, or indefinitely. As spamming talk pages, we can see how that works out. If it's too much after the trial, we can turn that off too.—
cyberpower
Chat:Online 23:16, 26 June 2015 (UTC)
replyWhat should we do here? -- Magioladitis ( talk) 13:54, 28 June 2015 (UTC) reply
Approved for trial (100 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. ·addshore· talk to me! 15:52, 28 June 2015 (UTC) reply
That's all from me - only the first four things are potentially problematic. Jo-Jo Eumerus ( talk) 18:53, 28 June 2015 (UTC) reply
Approved for extended trial (300 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. 500 is too much for us to check. Let's do 300 first. -- Magioladitis ( talk) 08:09, 5 July 2015 (UTC) reply
I've only done tests on a few articles to find the worst bugs. These are not all, but those I found when checking a selective number of articles to see its reliability. Saying "the bot can't know if it is dead on Wayback" is not a good excuse. That is a reason not to allow the bot task.
Legend for recurring errors:
Code | Error |
---|---|
(b) | The source URL was not dead. |
(c) | The archive-url is dead. |
Diff | URL | Archived | Note |
---|---|---|---|
[4] | [5] | [6] | (c) THE BOT REPEATED EDIT, AFTER BEEN REVERTED |
[7] | [8] | [9] | (c) THE BOT REPEATED EDIT, AFTER BEEN REVERTED |
[10] | [11] | [12] | (c) THE LINK WAS INLINE, NOT IN-REF OR UNDER EXTERNAL LINKS |
[13] | [14] | [15] | (c) |
[16] | [17] | [18] | (c) |
[19] | - | - | Added |dead-url=yes, even though |deadurl=yes already existed |
[20] | [21] | [22] | (c) |
^ | [23] | [24] | (b) |
^ | [25] | [26] | (c) |
[27] | [28] | [29] | (c) |
[30] | [31] | [32] | (b) |
[33] | - | - | REMOVED CONTENT FROM THE ARTICLE |
[34] | - | - | TRIED TO FIX STRAY REF IN COMMENTED TEXT, BREAKING TEMPLATE, REMOVING CONTENT |
[35] | - | - | REMOVED CONTENT FROM THE ARTICLE |
( t) Josve05a ( c) 16:18, 5 July 2015 (UTC) reply
{{
OperatorAssistanceNeeded|D}}
Magioladitis (
talk) 22:46, 7 July 2015 (UTC)
reply
The bot appears to be ready for one last trial before approval.— cyberpower Chat:Offline 06:04, 11 July 2015 (UTC) reply
The Earwig here you are! 500 pages :) Josve05a you may also want to have a look! -- Magioladitis ( talk) 08:53, 15 July 2015 (UTC) reply
Josve05a did you had the chance to check (some of) the 500 edits? -- Magioladitis ( talk) 14:32, 24 July 2015 (UTC) reply
{{{manually_checked}}}
or something is more "accurate", but it sounds like a plan. Has my 'vote'. (
t)
Josve05a (
c) 19:16, 6 August 2015 (UTC)
reply
{{{checked}}}
?—
cyberpower
Chat:Online 19:53, 6 August 2015 (UTC)
reply
Here's an example. {{ sourcecheck}}
{{ BAGAssistanceNeeded}} I recommend that this bot task be approved, on the condition that the template above are implemented. In case a bug arises which breaks a page, or changes page layout in any way the bot shall be turned off and not be turned on again until the bug has been fixed, in order to not break more pages. This should not be conditional. ( t) Josve05a ( c) 03:23, 8 August 2015 (UTC) reply
|checked=false
. This might be a problem given the categorization. Also, I'm not sure if requiring (or recommending, at the very least) manual intervention on over a hundred thousand talk pages is a good idea.Approved. Cyberpower has removed the message regarding un-archivable links. To the best of my knowledge, that was only remaining issue. Future feature requests, such as detecting unmarked dead links, should be made under a subsequent BRFA. — Earwig talk 00:24, 25 August 2015 (UTC) reply