Operator: Rotlink ( talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 21:04, Sunday August 18, 2013 ( UTC)
Automatic, Supervised, or Manual: Supervised
Programming language(s): Scala, Sweble, Wiki.java
Source code available: No. There are actually only few lines of code (and half of them are Wiki urls and passwords), because of using two powerful frameworks which do all the work.
Function overview: Find dead links (mostly by looking for {{
dead link}}
marks next to them) and try to recover them by searching web archives using
Memento protocol.
Links to relevant discussions (where appropriate): User_talk:RotlinkBot, Wikipedia:Bot_owners'_noticeboard#RotlinkBot_approved.3F
Edit period(s): Daily
Estimated number of pages affected: 1000/day (perhaps a bit more in the first few days)
Exclusion compliant (Yes/No): No. It was not exclusion compliant initially and so far nobody undid any change or make complaints against it. It can be easily made exclusion compliant.
Already has a bot flag (Yes/No): No
Function details: Find dead links (mostly by looking for {{
dead link}}
marks next to them) and try to recover them by searching web archives using
Memento protocol.
The current version of the bot software does not work with the other, non Memento-compatible, archives ( WebCite, WikiWix, Archive.pt, ...).
During the test run, about 3/4 of recovered links were found on Internet Archive (because it has the biggest and oldest database), about 1/4 on Archive.is (because of its proactive archiving of the new links appearing on the Wikis) and only few links on the other archives (because of their smaller size and regional specific).
I will consider the many edits already made as a long trial. I'll look through more edits as time permits.
One immediate issue is to have a much better edit summary, with a link to some page explaining the process of the bot.
For more info on how this was handled before, see H3llBot, DASHBot and BlevintronBot and their talk pages, so you are aware of the many issues that can come up.
Just to clarify, does "supervised" mean you will review every edit the bot makes, because that is what it means here? If so, I can be more lenient on corner cases as you will manually fix them all. ~~Hellknowz
For example, how do you determine links are actually dead? What are all the ways that "mostly by looking for dead link" actually means? For example, it is consensus that the link should be revisited at least twice to confirm it is not just server downtime or incorrect temporary 404 messages. This has been an issue, as there are false positives and many corner cases. ~~Hellknowz
One issue is that the bot does not respect existing date formats
[3]. |archivedate=
should be consistent, usually with |accessdate=
and definitely if other archive dates are present. They are exempt from {{
use dmy dates}} and {{
use mdy dates}}, although date format is a contentious issue. ~~Hellknowz
|archivedate=
to the templates which already have |accessdate=
as the example. ~~RotlinkFurther, {{ Wayback}} (or similar) is the preferred way of archiving bare external links, you should not just replace the original url without any extra info [4], this just creates extra issues as the url now permanently points to the archive instead of the original. You can create other service-specific templates for your needs, probably for Archive.is. ~~Hellknowz
Here [6] you change external url to a reference, which is afoul of WP:CITEVAR and bots should never change styles (unless that's their task). I'm guessing this is because the url sometimes becomes mangled as per previous paragraph? ~~Hellknowz
<ref>Peter A. [http://www.weather.com/day.html ''What a day!''] 2002. Weather Publishing.</ref>
is a valid reference using manual citation style.<ref>{{cite web |author=Peter A. |url=http://www.weather.com/day.html |title=What a day! |year=2002 |publisher=Weather Publishing.}}</ref>
is a valid reference using CS1.Other potential issues, like a {{ Wayback}} template already next to citation, various possible locations of {{ dead link}} (inside, outside ref). Archive parameters already in citations or partially missing. ~~Hellknowz
To clarify, does the bot use |accessdate=
, then |date=
for deciding what date the page snapshot should come from? If there are no date specified, does the bot actually parse the page history to find when the link was added and thus accessed. This is how previous bot(s) handled this. Unless consensus changes, we can't yet assume any date/copy will suffice. ~~Hellknowz
The bot does need to be exclusion compliant due to the nature of the task and the number of pages edited. You should also respect {{ inuse}} templates, although that's secondary. ~~Hellknowz
Can you please give more details on how Memento actually retrieves the archived copy? What guarantees are there that it is a match, what are their time ranges? I am going through their specs, but it is important that you yourself clarify enough detail for the BRFA, as we cannot easily approve a bot solely on third-party specs that may change. While technically an outside project, you are fully responsible for the correctness of the change. ~~Hellknowz
C:\>curl -i "http://web.archive.org/web/timemap/http://www.reuben.org/NewEngland/news.html" HTTP/1.1 200 OK Server: Tengine/1.4.6 Date: Mon, 19 Aug 2013 12:18:06 GMT Content-Type: application/link-format Transfer-Encoding: chunked Connection: keep-alive set-cookie: wayback_server=36; Domain=archive.org; Path=/; Expires=Wed, 18-Sep-13 12:18:06 GMT; X-Archive-Wayback-Perf: [IndexLoad: 140, IndexQueryTotal: 140, , RobotsFetchTotal: 0, , RobotsRedis: 0, RobotsTotal: 0, Total: 144, ] X-Archive-Playback: 0 X-Page-Cache: MISS <http:///www.reuben.org/NewEngland/news.html>; rel="original", <http://web.archive.org/web/timemap/link/http:///www.reuben.org/NewEngland/news.html>; rel="self"; type="application/link-format"; from="Wed, 13 Nov 2002 22:19:28 GMT"; until="Thu, 10 Feb 2005 17:57:37 GMT", <http://web.archive.org/web/http:///www.reuben.org/NewEngland/news.html>; rel="timegate", <http://web.archive.org/web/20021113221928/http://www.reuben.org/NewEngland/news.html>; rel="first memento"; datetime="Wed, 13 Nov 2002 22:19:28 GMT", <http://web.archive.org/web/20021212233113/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Thu, 12 Dec 2002 23:31:13 GMT", <http://web.archive.org/web/20030130034640/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Thu, 30 Jan 2003 03:46:40 GMT", <http://web.archive.org/web/20030322113257/http://reuben.org/newengland/news.html>; rel="memento"; datetime="Sat, 22 Mar 2003 11:32:57 GMT", <http://web.archive.org/web/20030325210902/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Tue, 25 Mar 2003 21:09:02 GMT", <http://web.archive.org/web/20030903030855/http://reuben.org/newengland/news.html>; rel="memento"; datetime="Wed, 03 Sep 2003 03:08:55 GMT", <http://web.archive.org/web/20040107081335/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Wed, 07 Jan 2004 08:13:35 GMT", <http://web.archive.org/web/20040319134618/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Fri, 19 Mar 2004 13:46:18 GMT", <http://web.archive.org/web/20040704184155/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sun, 04 Jul 2004 18:41:55 GMT", <http://web.archive.org/web/20040904163424/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sat, 04 Sep 2004 16:34:24 GMT", <http://web.archive.org/web/20041027085716/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Wed, 27 Oct 2004 08:57:16 GMT", <http://web.archive.org/web/20050116115009/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sun, 16 Jan 2005 11:50:09 GMT", <http://web.archive.org/web/20050210175737/http://www.reuben.org/NewEngland/news.html>; rel="last memento"; datetime="Thu, 10 Feb 2005 17:57:37 GMT"
Finally, what about replacing google cache with archive links? Do you intend to submit another BRFA for this? — HELLKNOWZ ▎ TALK 09:34, 19 August 2013 (UTC) reply
Operator: Rotlink ( talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 21:04, Sunday August 18, 2013 ( UTC)
Automatic, Supervised, or Manual: Supervised
Programming language(s): Scala, Sweble, Wiki.java
Source code available: No. There are actually only few lines of code (and half of them are Wiki urls and passwords), because of using two powerful frameworks which do all the work.
Function overview: Find dead links (mostly by looking for {{
dead link}}
marks next to them) and try to recover them by searching web archives using
Memento protocol.
Links to relevant discussions (where appropriate): User_talk:RotlinkBot, Wikipedia:Bot_owners'_noticeboard#RotlinkBot_approved.3F
Edit period(s): Daily
Estimated number of pages affected: 1000/day (perhaps a bit more in the first few days)
Exclusion compliant (Yes/No): No. It was not exclusion compliant initially and so far nobody undid any change or make complaints against it. It can be easily made exclusion compliant.
Already has a bot flag (Yes/No): No
Function details: Find dead links (mostly by looking for {{
dead link}}
marks next to them) and try to recover them by searching web archives using
Memento protocol.
The current version of the bot software does not work with the other, non Memento-compatible, archives ( WebCite, WikiWix, Archive.pt, ...).
During the test run, about 3/4 of recovered links were found on Internet Archive (because it has the biggest and oldest database), about 1/4 on Archive.is (because of its proactive archiving of the new links appearing on the Wikis) and only few links on the other archives (because of their smaller size and regional specific).
I will consider the many edits already made as a long trial. I'll look through more edits as time permits.
One immediate issue is to have a much better edit summary, with a link to some page explaining the process of the bot.
For more info on how this was handled before, see H3llBot, DASHBot and BlevintronBot and their talk pages, so you are aware of the many issues that can come up.
Just to clarify, does "supervised" mean you will review every edit the bot makes, because that is what it means here? If so, I can be more lenient on corner cases as you will manually fix them all. ~~Hellknowz
For example, how do you determine links are actually dead? What are all the ways that "mostly by looking for dead link" actually means? For example, it is consensus that the link should be revisited at least twice to confirm it is not just server downtime or incorrect temporary 404 messages. This has been an issue, as there are false positives and many corner cases. ~~Hellknowz
One issue is that the bot does not respect existing date formats
[3]. |archivedate=
should be consistent, usually with |accessdate=
and definitely if other archive dates are present. They are exempt from {{
use dmy dates}} and {{
use mdy dates}}, although date format is a contentious issue. ~~Hellknowz
|archivedate=
to the templates which already have |accessdate=
as the example. ~~RotlinkFurther, {{ Wayback}} (or similar) is the preferred way of archiving bare external links, you should not just replace the original url without any extra info [4], this just creates extra issues as the url now permanently points to the archive instead of the original. You can create other service-specific templates for your needs, probably for Archive.is. ~~Hellknowz
Here [6] you change external url to a reference, which is afoul of WP:CITEVAR and bots should never change styles (unless that's their task). I'm guessing this is because the url sometimes becomes mangled as per previous paragraph? ~~Hellknowz
<ref>Peter A. [http://www.weather.com/day.html ''What a day!''] 2002. Weather Publishing.</ref>
is a valid reference using manual citation style.<ref>{{cite web |author=Peter A. |url=http://www.weather.com/day.html |title=What a day! |year=2002 |publisher=Weather Publishing.}}</ref>
is a valid reference using CS1.Other potential issues, like a {{ Wayback}} template already next to citation, various possible locations of {{ dead link}} (inside, outside ref). Archive parameters already in citations or partially missing. ~~Hellknowz
To clarify, does the bot use |accessdate=
, then |date=
for deciding what date the page snapshot should come from? If there are no date specified, does the bot actually parse the page history to find when the link was added and thus accessed. This is how previous bot(s) handled this. Unless consensus changes, we can't yet assume any date/copy will suffice. ~~Hellknowz
The bot does need to be exclusion compliant due to the nature of the task and the number of pages edited. You should also respect {{ inuse}} templates, although that's secondary. ~~Hellknowz
Can you please give more details on how Memento actually retrieves the archived copy? What guarantees are there that it is a match, what are their time ranges? I am going through their specs, but it is important that you yourself clarify enough detail for the BRFA, as we cannot easily approve a bot solely on third-party specs that may change. While technically an outside project, you are fully responsible for the correctness of the change. ~~Hellknowz
C:\>curl -i "http://web.archive.org/web/timemap/http://www.reuben.org/NewEngland/news.html" HTTP/1.1 200 OK Server: Tengine/1.4.6 Date: Mon, 19 Aug 2013 12:18:06 GMT Content-Type: application/link-format Transfer-Encoding: chunked Connection: keep-alive set-cookie: wayback_server=36; Domain=archive.org; Path=/; Expires=Wed, 18-Sep-13 12:18:06 GMT; X-Archive-Wayback-Perf: [IndexLoad: 140, IndexQueryTotal: 140, , RobotsFetchTotal: 0, , RobotsRedis: 0, RobotsTotal: 0, Total: 144, ] X-Archive-Playback: 0 X-Page-Cache: MISS <http:///www.reuben.org/NewEngland/news.html>; rel="original", <http://web.archive.org/web/timemap/link/http:///www.reuben.org/NewEngland/news.html>; rel="self"; type="application/link-format"; from="Wed, 13 Nov 2002 22:19:28 GMT"; until="Thu, 10 Feb 2005 17:57:37 GMT", <http://web.archive.org/web/http:///www.reuben.org/NewEngland/news.html>; rel="timegate", <http://web.archive.org/web/20021113221928/http://www.reuben.org/NewEngland/news.html>; rel="first memento"; datetime="Wed, 13 Nov 2002 22:19:28 GMT", <http://web.archive.org/web/20021212233113/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Thu, 12 Dec 2002 23:31:13 GMT", <http://web.archive.org/web/20030130034640/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Thu, 30 Jan 2003 03:46:40 GMT", <http://web.archive.org/web/20030322113257/http://reuben.org/newengland/news.html>; rel="memento"; datetime="Sat, 22 Mar 2003 11:32:57 GMT", <http://web.archive.org/web/20030325210902/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Tue, 25 Mar 2003 21:09:02 GMT", <http://web.archive.org/web/20030903030855/http://reuben.org/newengland/news.html>; rel="memento"; datetime="Wed, 03 Sep 2003 03:08:55 GMT", <http://web.archive.org/web/20040107081335/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Wed, 07 Jan 2004 08:13:35 GMT", <http://web.archive.org/web/20040319134618/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Fri, 19 Mar 2004 13:46:18 GMT", <http://web.archive.org/web/20040704184155/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sun, 04 Jul 2004 18:41:55 GMT", <http://web.archive.org/web/20040904163424/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sat, 04 Sep 2004 16:34:24 GMT", <http://web.archive.org/web/20041027085716/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Wed, 27 Oct 2004 08:57:16 GMT", <http://web.archive.org/web/20050116115009/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sun, 16 Jan 2005 11:50:09 GMT", <http://web.archive.org/web/20050210175737/http://www.reuben.org/NewEngland/news.html>; rel="last memento"; datetime="Thu, 10 Feb 2005 17:57:37 GMT"
Finally, what about replacing google cache with archive links? Do you intend to submit another BRFA for this? — HELLKNOWZ ▎ TALK 09:34, 19 August 2013 (UTC) reply