Automatic or Manually Assisted: Both. I choose at every run.
Programming Language(s): Python. pywikipedia framework
Function Summary: Idea from Wikipedia:Bot requests#Incorrect Ref Syntax
Edit period(s) (e.g. Continuous, daily, one time run): Every time that a new dump is available.
Edit rate requested: I_don't_know. Standard pywikipedia throttle settings. Edit : from my test runs on fr:, from 5 to 10 edits per minute
Already has a bot flag (Y/N): N.
Function Details: Read User:DumZiBoT/refLinks. Please feel free to correct English mistakes if you find any, it is intended to be a runtime FAQ ;)
The script has been maually tested on fr, where I already have a botflag for fr:Utilisateur:DumZiBoT and ~40k automated edits. From the ~20 edits that I've made overthere, I've found several exceptions, which are fixed by now.
Sidenotes:
Do you have an estimate of the total number of pages to be edited in the first full run on enwiki? Does your parser for the dumps match things with extra spaces such as
<ref> [http://google.com ] </ref>
?
— Carl (
CBM ·
talk) 21:50, 29 December 2007 (UTC)
reply
I did another run on fr, longer this time : [1]. I had some rare encoding problems ( [2]) which I need to fix, but everything seems fine to me. NicDumZ ~ 00:07, 30 December 2007 (UTC) reply
I was thinking about doing similar to this for a while, except my bot would have been tailored to placing {{ cite news}} with most parameters (date, title, author) filled from the website? Any similar ideas? And will you be placing a comment notify editors that the title was automatically added (so they aren't left wonder why some of the links have strange titles)? I look forward to seeing your source code. — Dispenser ( talk) 08:47, 30 December 2007 (UTC) reply
accessdate
parameter could be stated. (how would you retrieve an author
parameter ?) Does that worth it ?<!-- [http://example.org] -->
into <!-- [http://example.org <!-- bot generated title -->title] -->
, which yields title] -->. —
Dispenser (
talk) 20:02, 30 December 2007 (UTC)
reply
I would definitely be interested to read through the code. I think that using {{ cite news}} isn't a good idea, since in most cases you won' be able to fill in the detail. You could add the string "Accessed YYYY-MM-DD" after the link, inside the ref tags, without much trouble. — Carl ( CBM · talk) 15:37, 30 December 2007 (UTC) reply
An
issue with JavaScripts in the HTML. Will application/xml
mime types be accepted? What about server that don't sent out any types? —
Dispenser (
talk) 10:03, 31 December 2007 (UTC)
reply
application/xml
mime types links are ignored, same goes for server not sending out types. If you have examples of links where I *should* not ignore them, I can try to improve this behaviour.
NicDumZ
~ 14:12, 31 December 2007 (UTC)
reply
|
Dispenser ( talk) 04:08, 3 January 2008 (UTC) reply
application/xhtml+xml
, application/xml
, and text/xml
mime types
NicDumZ
~ 08:11, 3 January 2008 (UTC)
reply
I encountered encoding problems with exotic charsets that BeautifulSoup couldnt handle properly (arabic charset windows-1256). I now try to retrieve charset from meta tags to give an accurate hint to BeautifulSoup : Problem solved for that particular charset. (TODO: When able to fetch a valid charset from meta tags, convert document encoding by myself, and retrieve title with a simple regex. When no valid charset is found, keep the current behavior, i.e. parse <title> markups with BeautifulSoup, encoding is "guessed" by BS)
NicDumZ
~ 13:29, 31 December 2007 (UTC)
reply
I also just implemented a soft switch-off : User:DumZiBoT/EditThisPageToStopMe. NicDumZ ~ 21:38, 2 January 2008 (UTC) reply
page.site().messages
(which look for the "You have new messages" in the HTML), this way you don't need to check manually page. —
Dispenser (
talk) 03:38, 3 January 2008 (UTC)
reply
How will you take case dead link such as those at User:Peteforsyth/O-vanish and special cases and redirecting to 404 page and the root page that are handled by my linkchecker tool? — Dispenser ( talk) 18:09, 30 December 2007 (UTC) reply
fr:Utilisateur:DumZiBoT/Test : http://www.oregonlive.com/news/oregonian/index.ssf?/base/news/1144292109305320.xml&coll=7 No title found... skipping http://www.oregonlive.com/newsflash/regional/index.ssf?/base/news-14/1142733867318430.xml&storylist=orlocal No title found... skipping http://www.oregonlive.com/weblogs/politics/index.ssf?/mtlogs/olive_politicsblog/archives/2006_08.html#170825 HTTP error (404) for http://www.oregonlive.com/weblogs/politics/index.ssf?/mtlogs/olive_politicsblog/archives/2006_08.html#170825 on fr:Utilisateur:DumZiBoT/Test [...]
|
regreq = re.compile(r'register|registration|login|logon|logged|subscribe|subscription|signup|signin|finalAuth|\Wauth\W', re.IGNORECASE)
soft404 = re.compile(r'\D404(\D|\Z)|error|errdoc|Not.{0,3}Found|sitedown|eventlog', re.IGNORECASE)
directoryIndex = re.compile(r'/$|/(default|index)\.(asp|aspx|cgi|htm|html|phtml|mpx|mspx|php|shtml|var)$', re.IGNORECASE)
# Untested
ErrorMsgs = re.compile(r'invalid article|page not found|Not Found|reached this page in error', re.IGNORECASE)
(←) I'm worried about false positives with you regexes. What about pages like [18], [19], or [20] ? Their titles matches regreq or soft404, and yet they are valid. Or are you talking about checking the links where you are redirected with these regexes ? (Even here, what if a SSHlogin.htm is moved permanently to an other address, but with the same name ? 'SSHlogin.htm' does match regreq !) I do not understand everything... NicDumZ ~ 14:12, 31 December 2007 (UTC) reply
The last days, I ran several 1000 edit batches on fr, and eventually, the whole db got reflinks.py'ed. Still waiting, but for now I got signaled only a single error 2 errors :
I guess that DumZiBoT is now ready for the big jump... NicDumZ ~ 10:19, 5 January 2008 (UTC) reply
{{ BAGAssistanceNeeded}}
Some comments, questions, updates, maybe ?
Thanks...
NicDumZ ~ 08:22, 6 January 2008 (UTC) reply
Trial went fine, I believe. We're discussing code improvements with Dispenser, but none of these improvements will alter significantly the bot's behavior : Do I need something else before getting fully approved ?
Thanks,
NicDumZ ~ 00:51, 10 January 2008 (UTC) reply
Is now available at User:DumZiBoT/reflinks.py.
Please edit it if you think that it needs improvements. I mean it. NicDumZ ~ 18:05, 8 January 2008 (UTC) reply
page.get()
cache the result so eliminated duplicationt = wikipedia.html2unicode(t)
, but you definitely improved my code. I quickly tested it on fr:, and as expected, it seems to be working.(ident) Thanks a lot for your input. It was, as always, very, very useful... I'm somewhat running out of time these days, but :
NicDumZ ~ 00:08, 16 January 2008 (UTC) reply
|
. , \ : ; ? ! [ ] < > " (space)
are significant at the end of URLs. Observe
URLs ending with .,\:;?!
Character that break the url from the title in the bracketed case are: Due to the way text is processed the follow quirk happens when leaving out the space and using formatting: Most parser not implementing HTML rendering will fail with these links. |
Automatic or Manually Assisted: Both. I choose at every run.
Programming Language(s): Python. pywikipedia framework
Function Summary: Idea from Wikipedia:Bot requests#Incorrect Ref Syntax
Edit period(s) (e.g. Continuous, daily, one time run): Every time that a new dump is available.
Edit rate requested: I_don't_know. Standard pywikipedia throttle settings. Edit : from my test runs on fr:, from 5 to 10 edits per minute
Already has a bot flag (Y/N): N.
Function Details: Read User:DumZiBoT/refLinks. Please feel free to correct English mistakes if you find any, it is intended to be a runtime FAQ ;)
The script has been maually tested on fr, where I already have a botflag for fr:Utilisateur:DumZiBoT and ~40k automated edits. From the ~20 edits that I've made overthere, I've found several exceptions, which are fixed by now.
Sidenotes:
Do you have an estimate of the total number of pages to be edited in the first full run on enwiki? Does your parser for the dumps match things with extra spaces such as
<ref> [http://google.com ] </ref>
?
— Carl (
CBM ·
talk) 21:50, 29 December 2007 (UTC)
reply
I did another run on fr, longer this time : [1]. I had some rare encoding problems ( [2]) which I need to fix, but everything seems fine to me. NicDumZ ~ 00:07, 30 December 2007 (UTC) reply
I was thinking about doing similar to this for a while, except my bot would have been tailored to placing {{ cite news}} with most parameters (date, title, author) filled from the website? Any similar ideas? And will you be placing a comment notify editors that the title was automatically added (so they aren't left wonder why some of the links have strange titles)? I look forward to seeing your source code. — Dispenser ( talk) 08:47, 30 December 2007 (UTC) reply
accessdate
parameter could be stated. (how would you retrieve an author
parameter ?) Does that worth it ?<!-- [http://example.org] -->
into <!-- [http://example.org <!-- bot generated title -->title] -->
, which yields title] -->. —
Dispenser (
talk) 20:02, 30 December 2007 (UTC)
reply
I would definitely be interested to read through the code. I think that using {{ cite news}} isn't a good idea, since in most cases you won' be able to fill in the detail. You could add the string "Accessed YYYY-MM-DD" after the link, inside the ref tags, without much trouble. — Carl ( CBM · talk) 15:37, 30 December 2007 (UTC) reply
An
issue with JavaScripts in the HTML. Will application/xml
mime types be accepted? What about server that don't sent out any types? —
Dispenser (
talk) 10:03, 31 December 2007 (UTC)
reply
application/xml
mime types links are ignored, same goes for server not sending out types. If you have examples of links where I *should* not ignore them, I can try to improve this behaviour.
NicDumZ
~ 14:12, 31 December 2007 (UTC)
reply
|
Dispenser ( talk) 04:08, 3 January 2008 (UTC) reply
application/xhtml+xml
, application/xml
, and text/xml
mime types
NicDumZ
~ 08:11, 3 January 2008 (UTC)
reply
I encountered encoding problems with exotic charsets that BeautifulSoup couldnt handle properly (arabic charset windows-1256). I now try to retrieve charset from meta tags to give an accurate hint to BeautifulSoup : Problem solved for that particular charset. (TODO: When able to fetch a valid charset from meta tags, convert document encoding by myself, and retrieve title with a simple regex. When no valid charset is found, keep the current behavior, i.e. parse <title> markups with BeautifulSoup, encoding is "guessed" by BS)
NicDumZ
~ 13:29, 31 December 2007 (UTC)
reply
I also just implemented a soft switch-off : User:DumZiBoT/EditThisPageToStopMe. NicDumZ ~ 21:38, 2 January 2008 (UTC) reply
page.site().messages
(which look for the "You have new messages" in the HTML), this way you don't need to check manually page. —
Dispenser (
talk) 03:38, 3 January 2008 (UTC)
reply
How will you take case dead link such as those at User:Peteforsyth/O-vanish and special cases and redirecting to 404 page and the root page that are handled by my linkchecker tool? — Dispenser ( talk) 18:09, 30 December 2007 (UTC) reply
fr:Utilisateur:DumZiBoT/Test : http://www.oregonlive.com/news/oregonian/index.ssf?/base/news/1144292109305320.xml&coll=7 No title found... skipping http://www.oregonlive.com/newsflash/regional/index.ssf?/base/news-14/1142733867318430.xml&storylist=orlocal No title found... skipping http://www.oregonlive.com/weblogs/politics/index.ssf?/mtlogs/olive_politicsblog/archives/2006_08.html#170825 HTTP error (404) for http://www.oregonlive.com/weblogs/politics/index.ssf?/mtlogs/olive_politicsblog/archives/2006_08.html#170825 on fr:Utilisateur:DumZiBoT/Test [...]
|
regreq = re.compile(r'register|registration|login|logon|logged|subscribe|subscription|signup|signin|finalAuth|\Wauth\W', re.IGNORECASE)
soft404 = re.compile(r'\D404(\D|\Z)|error|errdoc|Not.{0,3}Found|sitedown|eventlog', re.IGNORECASE)
directoryIndex = re.compile(r'/$|/(default|index)\.(asp|aspx|cgi|htm|html|phtml|mpx|mspx|php|shtml|var)$', re.IGNORECASE)
# Untested
ErrorMsgs = re.compile(r'invalid article|page not found|Not Found|reached this page in error', re.IGNORECASE)
(←) I'm worried about false positives with you regexes. What about pages like [18], [19], or [20] ? Their titles matches regreq or soft404, and yet they are valid. Or are you talking about checking the links where you are redirected with these regexes ? (Even here, what if a SSHlogin.htm is moved permanently to an other address, but with the same name ? 'SSHlogin.htm' does match regreq !) I do not understand everything... NicDumZ ~ 14:12, 31 December 2007 (UTC) reply
The last days, I ran several 1000 edit batches on fr, and eventually, the whole db got reflinks.py'ed. Still waiting, but for now I got signaled only a single error 2 errors :
I guess that DumZiBoT is now ready for the big jump... NicDumZ ~ 10:19, 5 January 2008 (UTC) reply
{{ BAGAssistanceNeeded}}
Some comments, questions, updates, maybe ?
Thanks...
NicDumZ ~ 08:22, 6 January 2008 (UTC) reply
Trial went fine, I believe. We're discussing code improvements with Dispenser, but none of these improvements will alter significantly the bot's behavior : Do I need something else before getting fully approved ?
Thanks,
NicDumZ ~ 00:51, 10 January 2008 (UTC) reply
Is now available at User:DumZiBoT/reflinks.py.
Please edit it if you think that it needs improvements. I mean it. NicDumZ ~ 18:05, 8 January 2008 (UTC) reply
page.get()
cache the result so eliminated duplicationt = wikipedia.html2unicode(t)
, but you definitely improved my code. I quickly tested it on fr:, and as expected, it seems to be working.(ident) Thanks a lot for your input. It was, as always, very, very useful... I'm somewhat running out of time these days, but :
NicDumZ ~ 00:08, 16 January 2008 (UTC) reply
|
. , \ : ; ? ! [ ] < > " (space)
are significant at the end of URLs. Observe
URLs ending with .,\:;?!
Character that break the url from the title in the bracketed case are: Due to the way text is processed the follow quirk happens when leaving out the space and using formatting: Most parser not implementing HTML rendering will fail with these links. |