Operator: Tim1357 ( talk · contribs)
Automatic or Manually assisted: Automatic
Programming language(s): Python
Source code available: I need to figure out SVN
Function overview: Find suitable archived copies for dead links on the Internet Archive
Links to relevant discussions (where appropriate): I have to find them all, but it's there.
Edit period(s): Every Night
Estimated number of pages affected: N/A
Exclusion compliant (Y/N): Yes
Already has a bot flag (Y/N): Yes
Function details:
<ref>
tags.|accessdate
. If so, skip to step 5.* Some pages return 404 on the first try because their disks are spinning up.
** I have asked for permission to query wikiblame, waiting for reply.
*** The people at the Internet Archive told me I could do this given I use a identify-able user-agent (with email and such)
I did testing (under my own account) in my user space, and did one little in the real world to make sure everything worked. Tim 1357 talk 02:37, 3 May 2010 (UTC) reply
Needs to be more than a day that you wait. More like a week or so. So you'll need to store the dead URLs for that period of time. Shouldn't be too difficult. -- MZMcBride ( talk) 04:20, 3 May 2010 (UTC) reply
I'm sure that you will work to improve efficiency when you get a handle on where the bottlenecks are. Do you have a way of measuring where the bottlenecks are?
Have you figured out Subversion yet? Try http://svnbook.red-bean.com/en/1.0/svn-book.html
What technique will you use to select the pages to operate on? Do you have a target edit rate for the bot? Josh Parris 09:49, 10 May 2010 (UTC) reply
Also, to reduce false deadlink tripping on momentary server downtime, you might consider checking the google cache for its timestamp if not its content. LeadSongDog come howl! 17:09, 13 May 2010 (UTC) reply
should not be replaced withFoo bar, baz, spam spam spam lorem foo
.Foo bar, baz, spam spam spam Archived May 27, 2009(Timestamp length), at the Wayback Machine lorem foo
What will happen when run against Why Is Sex Fun? Josh Parris 09:59, 17 May 2010 (UTC) reply
Is Waybacks's star "*" notation for changed revisons reliable enough to use links outside the 6 month window? — Hellknowz ▎ talk 15:55, 18 May 2010 (UTC) reply
Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's see the bot in action on a larger sample set. Josh Parris 02:41, 18 May 2010 (UTC) reply
Perhaps the thing I most struggled with is determining the accessdate of a URL. For that reason, I thought it'd be nice to expand on how I go about determining the date of insertion.
However, if there is no available accessdate associated with the url, then it scans the article's recent history (1000 revisions) to find the closest date of insertion. Tim 1357 talk 04:19, 31 May 2010 (UTC) reply
the URL linked in this edit to England national football team manager is a 404 (sorta). Do you have a mechanism to check if any of the other edits linked to not-helpful archives like this one? Josh Parris 08:25, 31 May 2010 (UTC) reply
This edit claims genfixes; none are made. Josh Parris 08:33, 31 May 2010 (UTC) reply
This edit doesn't mention marking dead links; perhaps Found archives for 5 of 17 dead links? Josh Parris 09:06, 31 May 2010 (UTC) reply
As a general comment, it would be nice if the bot could explain a bit more in the summary, may be give a link to task descritpion page. — Hellknowz ▎ talk 15:49, 31 May 2010 (UTC) reply
Another note, undated references added in the very first revision have a high chance of being copied/split from another article. This means the addition date is not the access date. For example, 2007 suicide bombings in Iraq, first revision. — Hellknowz ▎ talk 15:01, 1 June 2010 (UTC) reply
Are there any other concerns that I have not met? Tim 1357 talk 02:11, 3 June 2010 (UTC) reply
Approved. Good, good… go break a leg! — The Earwig (talk) 21:04, 9 June 2010 (UTC) reply
Operator: Tim1357 ( talk · contribs)
Automatic or Manually assisted: Automatic
Programming language(s): Python
Source code available: I need to figure out SVN
Function overview: Find suitable archived copies for dead links on the Internet Archive
Links to relevant discussions (where appropriate): I have to find them all, but it's there.
Edit period(s): Every Night
Estimated number of pages affected: N/A
Exclusion compliant (Y/N): Yes
Already has a bot flag (Y/N): Yes
Function details:
<ref>
tags.|accessdate
. If so, skip to step 5.* Some pages return 404 on the first try because their disks are spinning up.
** I have asked for permission to query wikiblame, waiting for reply.
*** The people at the Internet Archive told me I could do this given I use a identify-able user-agent (with email and such)
I did testing (under my own account) in my user space, and did one little in the real world to make sure everything worked. Tim 1357 talk 02:37, 3 May 2010 (UTC) reply
Needs to be more than a day that you wait. More like a week or so. So you'll need to store the dead URLs for that period of time. Shouldn't be too difficult. -- MZMcBride ( talk) 04:20, 3 May 2010 (UTC) reply
I'm sure that you will work to improve efficiency when you get a handle on where the bottlenecks are. Do you have a way of measuring where the bottlenecks are?
Have you figured out Subversion yet? Try http://svnbook.red-bean.com/en/1.0/svn-book.html
What technique will you use to select the pages to operate on? Do you have a target edit rate for the bot? Josh Parris 09:49, 10 May 2010 (UTC) reply
Also, to reduce false deadlink tripping on momentary server downtime, you might consider checking the google cache for its timestamp if not its content. LeadSongDog come howl! 17:09, 13 May 2010 (UTC) reply
should not be replaced withFoo bar, baz, spam spam spam lorem foo
.Foo bar, baz, spam spam spam Archived May 27, 2009(Timestamp length), at the Wayback Machine lorem foo
What will happen when run against Why Is Sex Fun? Josh Parris 09:59, 17 May 2010 (UTC) reply
Is Waybacks's star "*" notation for changed revisons reliable enough to use links outside the 6 month window? — Hellknowz ▎ talk 15:55, 18 May 2010 (UTC) reply
Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's see the bot in action on a larger sample set. Josh Parris 02:41, 18 May 2010 (UTC) reply
Perhaps the thing I most struggled with is determining the accessdate of a URL. For that reason, I thought it'd be nice to expand on how I go about determining the date of insertion.
However, if there is no available accessdate associated with the url, then it scans the article's recent history (1000 revisions) to find the closest date of insertion. Tim 1357 talk 04:19, 31 May 2010 (UTC) reply
the URL linked in this edit to England national football team manager is a 404 (sorta). Do you have a mechanism to check if any of the other edits linked to not-helpful archives like this one? Josh Parris 08:25, 31 May 2010 (UTC) reply
This edit claims genfixes; none are made. Josh Parris 08:33, 31 May 2010 (UTC) reply
This edit doesn't mention marking dead links; perhaps Found archives for 5 of 17 dead links? Josh Parris 09:06, 31 May 2010 (UTC) reply
As a general comment, it would be nice if the bot could explain a bit more in the summary, may be give a link to task descritpion page. — Hellknowz ▎ talk 15:49, 31 May 2010 (UTC) reply
Another note, undated references added in the very first revision have a high chance of being copied/split from another article. This means the addition date is not the access date. For example, 2007 suicide bombings in Iraq, first revision. — Hellknowz ▎ talk 15:01, 1 June 2010 (UTC) reply
Are there any other concerns that I have not met? Tim 1357 talk 02:11, 3 June 2010 (UTC) reply
Approved. Good, good… go break a leg! — The Earwig (talk) 21:04, 9 June 2010 (UTC) reply