Automatic or Manually Assisted: (Mostly) Automatic, supervised
Programming Language(s): Perl
Function Summary: To correct dead links due to link rot
Edit period(s) (e.g. Continuous, daily, one time run): as needed
Already has a bot flag (Y/N): N
Function Details: DeadLinkBOT's purpose is to update links that are invalid due to link rot. The first version of the program will simply replace all instances of a user supplied link with an updated link (i.e. a change pre-approved by me). When needed, the program is capable of making simply determinations about the nature of the WP page in order to pick a new link from a list of alternatives (given user supplied rules). In the future, the program will be expanded to actively seek out updated links after retrieving a list of dead links to be updated. These more advanced changes will require user confirmation. When a page is edited to update a link, the bot will also apply AWB-like general fixes.
I have tested the bot with local writes and all works according to plan. I would appreciate it if a trial could be approved for actual wiki editing soon. Thanks. -- ThaddeusB ( talk) 02:34, 14 December 2008 (UTC) reply
Can a member of BAG please explain exactly what they want me to do to prove this bot works correctly? I've tested it locally, answered every question here, released the source, and tried to be patient but no one seems to be willing to act. What do I need to do to get the ball rolling? -- ThaddeusB ( talk) 02:34, 18 December 2008 (UTC) reply
I would like to see the bot split into two parts: A read-only bot that identifies items that need replacing, and a change-bot that works off the generated list, with cautions that the change-bot would only make changes if the text to be changed and it's immediate surrounding text hadn't been edited in the meantime. There are at least three good reasons for this:
Once the two bots are working nicely separately, they can be interleaved, so as an item is added to the list, it is immediately processed and the edit is made. davidwr/( talk)/( contribs)/( e-mail) 19:50, 18 December 2008 (UTC) reply
There is already a bot request for which this bot would be useful:
Wikipedia:Bot_requests#Bulk-replace_URL_for_Handbook_of_Texas_Online
davidwr/(
talk)/(
contribs)/(
e-mail)
19:50, 18 December 2008 (UTC)
reply
Find Links (this part is not yet written, but like you say doesn't actually require approval since it doesn't do any wiki writing)
- Gets a dead URL from a Wikipedia:Link_rot sub page
- Finds the last good version of said URL using archive.org or search engine cache
- Make sure it wasn't an ad page set up by a domain squatter; if so, find an older version
- See if the last version mentions a site move, if so check the move URL to make sure it is good & that the content matches
- If no move URL is found, perform a search engine search using block portions of the last good page to try and find where it moved
- Write recommended changes to file for review
Wait a week to insure the URL is indeed dead
Alternatively if a user (such as yourself) supplies an URL that needs changed, the URL can go directly into the "for review" stack
URL reviewed by me to make sure the recommended change is accurate, then its moved into a machine readable to be processed file
Processing
- Get URL + change(s) from file; change can require a simply test such as making sure "text" is in the page to be changed and make decisions based on those tests; the new text can be anything - presumably a URL or a template.
- Use Special:LinkSearch to find all instances of the URL on wikipedia - this list could be output to a file if you want
- Make changes using perl's s/ command; scope can be limited, if desired (i.e. to article space only, for example; Archives are always excluded). Alternatively, I could add a simply check to make sure both the old & new URL don't appear on the same page - that should remove any false positives - and write those cases to a file for manual review.
- Stop every so often for review by me to make sure everything is working OK.
Approved for trial (100 edits or 8 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. —
Ree
dy
22:10, 18 December 2008 (UTC)
reply
Why not just use the appropriate options (basetimestamp
and starttimestamp
) to the API edit command to detect edit conflicts the normal way, instead of trying to do some odd "possibly overwrite others edits, and then try to self-revert" scheme?
Anomie
⚔
23:59, 18 December 2008 (UTC)
reply
Trial complete.
I rewrote the program to query the API directly (rather that using perlwikipedia.pm or an equivalent). This enabled more efficient resource usage and the ability to correctly detect edit conflicts. However, it did lead to some temporary bugs. Most embarrassingly, the bot's first 5 edits blanked pages due to a variable being mistyped. (Doh!) Of course, I promptly fixed any errors the bot made and corrected the code to avoid repeating them. :)
The bot can now detect edit conflicts and false positives (e.g. on talk pages), although neither arose in the trial period. It ignores Wikipedia: space articles (excluding WikiProject pages), archives, sandboxes, and pages in its own userspace.
After everything was working, DeadLinksBOT made just under 50 edits correcting angeltown.com links. A log of these edits can be found at User:DeadLinkBOT/Logs/AngelTowns.log. I have manually reviewed them and have also invited Kittybrewster to review and comment here.
Here is a representative sample of the kinds of corrections it can routinely make:
Collectively, these edits are represent both the typical workload of the bot (straight URL replacement) and the most complicated case that will regularly arise (transition to simple template). I am confident that the bot will be 99%+ accurate with these edits.
During the trial, I used the other 50 approved edits to parse a much more difficult situation that the bot would typically face - transitioning a dead URL to a complicated template (Handbook of Texas). This change uses a custom function that no other changes will use, so its accuracy is independent of the normal functional accuracy. Since the parsing is fairly complex, I have had to make several changes to it so the edit history ( User:DeadLinkBOT/Logs/HandbookOfTexas.log) is not completely representative of the current functionality. In particular, t he bot made several errors that it would no longer make. (All changes were manually verified and corrected when needed.) The bot should be much closer to fully accurate now, but all changes will be manually verified for the foreseeable future.
Here is a representative sample of the kinds of corrections it can make:
Again, these edits are not typical but rather representative of the most complicated edits the bot would ever do. If the need for this sort of change ever arises again, I would of course be manually verifying everything again. I have explained my methodology to davidwr and invited him to comment here. -- ThaddeusB ( talk) 07:14, 24 December 2008 (UTC) reply
Statements like "99%+ accurate" are meaningless, since there'll always be a user who does the unexpected and sets examples for others to follow. So anyway I'm author and maintainer of the Checklinks tool and PDFbot. Checklinks detects, lists, and allows user repairs of dead links on pages, it is mostly used as a link checker on article review. PDFbot had been approved for similar dead link repair; however, it actually checks every link it replaces to make sure it works.
So here are some of the cavoits
That is all I can think of at the moment, if this bot is approved can it simply nytimes link from
[13] to [14] to remove the login requirements. — Dispenser 18:47, 31 December 2008 (UTC) reply
(?<!\w://[^][<>\s"]*)
(?<=[\s!?.,'"}|>*])
should be functionally equivalent.Any chance of getting this approved soon? The trial ended almost 2 weeks ago and I have addressed all the concerns raised. I'd like to get started on fixing up more dead links soon. Thanks. -- ThaddeusB ( talk) 23:45, 4 January 2009 (UTC) reply
{{ BAGAssistanceNeeded}} -- Tinucherian 12:04, 8 January 2009 (UTC) reply
I've got a few more user submitted link updates to work on now. Now >5000 links waiting to be updated. Hoping to get started soon, ThaddeusB ( talk) 13:26, 8 January 2009 (UTC) reply
Automatic or Manually Assisted: (Mostly) Automatic, supervised
Programming Language(s): Perl
Function Summary: To correct dead links due to link rot
Edit period(s) (e.g. Continuous, daily, one time run): as needed
Already has a bot flag (Y/N): N
Function Details: DeadLinkBOT's purpose is to update links that are invalid due to link rot. The first version of the program will simply replace all instances of a user supplied link with an updated link (i.e. a change pre-approved by me). When needed, the program is capable of making simply determinations about the nature of the WP page in order to pick a new link from a list of alternatives (given user supplied rules). In the future, the program will be expanded to actively seek out updated links after retrieving a list of dead links to be updated. These more advanced changes will require user confirmation. When a page is edited to update a link, the bot will also apply AWB-like general fixes.
I have tested the bot with local writes and all works according to plan. I would appreciate it if a trial could be approved for actual wiki editing soon. Thanks. -- ThaddeusB ( talk) 02:34, 14 December 2008 (UTC) reply
Can a member of BAG please explain exactly what they want me to do to prove this bot works correctly? I've tested it locally, answered every question here, released the source, and tried to be patient but no one seems to be willing to act. What do I need to do to get the ball rolling? -- ThaddeusB ( talk) 02:34, 18 December 2008 (UTC) reply
I would like to see the bot split into two parts: A read-only bot that identifies items that need replacing, and a change-bot that works off the generated list, with cautions that the change-bot would only make changes if the text to be changed and it's immediate surrounding text hadn't been edited in the meantime. There are at least three good reasons for this:
Once the two bots are working nicely separately, they can be interleaved, so as an item is added to the list, it is immediately processed and the edit is made. davidwr/( talk)/( contribs)/( e-mail) 19:50, 18 December 2008 (UTC) reply
There is already a bot request for which this bot would be useful:
Wikipedia:Bot_requests#Bulk-replace_URL_for_Handbook_of_Texas_Online
davidwr/(
talk)/(
contribs)/(
e-mail)
19:50, 18 December 2008 (UTC)
reply
Find Links (this part is not yet written, but like you say doesn't actually require approval since it doesn't do any wiki writing)
- Gets a dead URL from a Wikipedia:Link_rot sub page
- Finds the last good version of said URL using archive.org or search engine cache
- Make sure it wasn't an ad page set up by a domain squatter; if so, find an older version
- See if the last version mentions a site move, if so check the move URL to make sure it is good & that the content matches
- If no move URL is found, perform a search engine search using block portions of the last good page to try and find where it moved
- Write recommended changes to file for review
Wait a week to insure the URL is indeed dead
Alternatively if a user (such as yourself) supplies an URL that needs changed, the URL can go directly into the "for review" stack
URL reviewed by me to make sure the recommended change is accurate, then its moved into a machine readable to be processed file
Processing
- Get URL + change(s) from file; change can require a simply test such as making sure "text" is in the page to be changed and make decisions based on those tests; the new text can be anything - presumably a URL or a template.
- Use Special:LinkSearch to find all instances of the URL on wikipedia - this list could be output to a file if you want
- Make changes using perl's s/ command; scope can be limited, if desired (i.e. to article space only, for example; Archives are always excluded). Alternatively, I could add a simply check to make sure both the old & new URL don't appear on the same page - that should remove any false positives - and write those cases to a file for manual review.
- Stop every so often for review by me to make sure everything is working OK.
Approved for trial (100 edits or 8 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. —
Ree
dy
22:10, 18 December 2008 (UTC)
reply
Why not just use the appropriate options (basetimestamp
and starttimestamp
) to the API edit command to detect edit conflicts the normal way, instead of trying to do some odd "possibly overwrite others edits, and then try to self-revert" scheme?
Anomie
⚔
23:59, 18 December 2008 (UTC)
reply
Trial complete.
I rewrote the program to query the API directly (rather that using perlwikipedia.pm or an equivalent). This enabled more efficient resource usage and the ability to correctly detect edit conflicts. However, it did lead to some temporary bugs. Most embarrassingly, the bot's first 5 edits blanked pages due to a variable being mistyped. (Doh!) Of course, I promptly fixed any errors the bot made and corrected the code to avoid repeating them. :)
The bot can now detect edit conflicts and false positives (e.g. on talk pages), although neither arose in the trial period. It ignores Wikipedia: space articles (excluding WikiProject pages), archives, sandboxes, and pages in its own userspace.
After everything was working, DeadLinksBOT made just under 50 edits correcting angeltown.com links. A log of these edits can be found at User:DeadLinkBOT/Logs/AngelTowns.log. I have manually reviewed them and have also invited Kittybrewster to review and comment here.
Here is a representative sample of the kinds of corrections it can routinely make:
Collectively, these edits are represent both the typical workload of the bot (straight URL replacement) and the most complicated case that will regularly arise (transition to simple template). I am confident that the bot will be 99%+ accurate with these edits.
During the trial, I used the other 50 approved edits to parse a much more difficult situation that the bot would typically face - transitioning a dead URL to a complicated template (Handbook of Texas). This change uses a custom function that no other changes will use, so its accuracy is independent of the normal functional accuracy. Since the parsing is fairly complex, I have had to make several changes to it so the edit history ( User:DeadLinkBOT/Logs/HandbookOfTexas.log) is not completely representative of the current functionality. In particular, t he bot made several errors that it would no longer make. (All changes were manually verified and corrected when needed.) The bot should be much closer to fully accurate now, but all changes will be manually verified for the foreseeable future.
Here is a representative sample of the kinds of corrections it can make:
Again, these edits are not typical but rather representative of the most complicated edits the bot would ever do. If the need for this sort of change ever arises again, I would of course be manually verifying everything again. I have explained my methodology to davidwr and invited him to comment here. -- ThaddeusB ( talk) 07:14, 24 December 2008 (UTC) reply
Statements like "99%+ accurate" are meaningless, since there'll always be a user who does the unexpected and sets examples for others to follow. So anyway I'm author and maintainer of the Checklinks tool and PDFbot. Checklinks detects, lists, and allows user repairs of dead links on pages, it is mostly used as a link checker on article review. PDFbot had been approved for similar dead link repair; however, it actually checks every link it replaces to make sure it works.
So here are some of the cavoits
That is all I can think of at the moment, if this bot is approved can it simply nytimes link from
[13] to [14] to remove the login requirements. — Dispenser 18:47, 31 December 2008 (UTC) reply
(?<!\w://[^][<>\s"]*)
(?<=[\s!?.,'"}|>*])
should be functionally equivalent.Any chance of getting this approved soon? The trial ended almost 2 weeks ago and I have addressed all the concerns raised. I'd like to get started on fixing up more dead links soon. Thanks. -- ThaddeusB ( talk) 23:45, 4 January 2009 (UTC) reply
{{ BAGAssistanceNeeded}} -- Tinucherian 12:04, 8 January 2009 (UTC) reply
I've got a few more user submitted link updates to work on now. Now >5000 links waiting to be updated. Hoping to get started soon, ThaddeusB ( talk) 13:26, 8 January 2009 (UTC) reply