Operator: Primefac ( talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 14:10, Friday, March 24, 2017 ( UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): AWB
Source code available: AWB
Function overview: Replace magic words with templates
Links to relevant discussions (where appropriate): RFC, RFC follow up
Edit period(s): one time run initially, then maybe once a month until the magic link functionality is removed
Estimated number of pages affected: 378k (369k
ISBN, 7000
PMID, 2150
RFC)
Exclusion compliant (Yes/No): yes
Already has a bot flag (Yes/No): yes
Function details:
ISBN\s([0-9 -]{9}[0-9 -]{4}?[0-9 -]{3}?[0-9X])(?!-|[0-9])
→{{ISBN|$1}}
RFC\s([0-9]{1,4})(?![0-9])
→{{IETF RFC|$1}}
PMID\s([0-9]+)
→{{PMID|$1}}
I've tried to account for every situation, and the only potential issue I can see is that strange ISBN values (specifically, mis-typed 11- or 14-digit ISBNs) get captured, but I genuinely can't figure out how to account for every single possibility without having an unnecessarily long regex (basically accounting for every possible combination of hyphen-and-number). Primefac ( talk) 14:10, 24 March 2017 (UTC) reply
ISBN\s([0-9 -]+[0-9X])
PMID\s([0-9]+)
RFC\s([0-9]+)
(modified from the above to take any number following the RFC) Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete., 25 edits of each.
Headbomb {
talk /
contribs /
physics /
books}
18:07, 24 March 2017 (UTC)
reply
|id=
might have more than one identifier, but when there's only one, it's much better to make use of the existing CS1 parameter. See
Help:Citation Style 1#Identifiers for which exists.\|(\s*)id(\s*)=(\s*)\{\{(arxiv|asin|bibcode|biorxiv|citeseerx|doi|eissn|hdl|isbn|ismn|issn|jfm|jstor|lccn|mr|oclc|ol|osti|pmc|pmid|rfc|ssrn|zbl)\s*\|([^(\}\|)]*)\}\}\s*(\s*(\||\}\}))
replace |$1$4$2=$3$5$6
. It works pretty well last I checked.
Headbomb {
talk /
contribs /
physics /
books}
22:49, 24 March 2017 (UTC)
replyI note the regular expression for MediaWiki's magic ISBN linking is, I believe, \bISBN(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs})++((?:97[89](?:-|(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs}))?)?(?:[0-9](?:-|(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs}))?){9}[0-9Xx])\b
. Most of the complexity is different kinds of whitespace that are allowed. It also skips anything that's inside link text, inside an HTML tag (e.g. in an attribute), and inside a bare URL. If you use something else, you're liable to miss and/or include things that are/aren't currently linked. The RFC and PMID links are almost right, they're \b(?:RFC|PMID)(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs})++([0-9]+)\b
. The actual code for all of it is
here, although it runs after much of the wikitext has been parsed.
Anomie
⚔
23:03, 24 March 2017 (UTC)
reply
Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete., 100 edits each. Please update summary to indicate that it may also do things beyond magic link deprecation.
Headbomb {
talk /
contribs /
physics /
books}
00:58, 25 March 2017 (UTC)
reply
@ Primefac: Fully done now. Also, is there a reason why AWB genfixes shouldn't be enabled during the main run? Headbomb { talk / contribs / physics / books} 02:41, 25 March 2017 (UTC) reply
doi:([\S]*)/([A-Za-z0-9.]*)
→ {{doi|$1/$2}}
, and for PMCID I've got PMCID:PMC([0-9]+)
→ {{PMC|$1}}
.
Primefac (
talk)
02:58, 25 March 2017 (UTC)
reply
( edit conflict) I came up with
(?<!(\[\[|=))doi:?\s*10\.([^(\s)]+?)(\.)?(\])?<(\/)?ref>
replace {{doi|10.$2}}$3$4<$5ref>
, followed by(?<!(\[\[|=))doi:?\s*10\.([^(\s)]+?)(\.|,|\])?(\s)
replace {{doi|10.$2}}$3$4
There is some GIGO cases, but I've yet to see a place where this screws up anything. Errors tend to be that GIGO will result in a missed conversion, than wrong conversion. Headbomb { talk / contribs / physics / books} 12:53, 25 March 2017 (UTC) reply
I think this large scale task is a good opportunity that the bot also does AWB's general fixes so that all secondary tasks are done in addition to the main task. Many other bots have been running general fixes while making a primary task. It's a great time. -- Magioladitis ( talk) 14:11, 25 March 2017 (UTC) reply
Based on my involvement with ISBNs I proposed same additions to the current regex. For instance, to allo catch tabs and dashes. There is also a lift of bad ISBns found in Wikipedia:WikiProject Check Wikipedia/ISBN errors. -- Magioladitis ( talk) 14:27, 25 March 2017 (UTC) reply
Did this task at other wiki. Some notes:
Looking through the RFC category, it looks like the false positive rate for pages in this category is well over 1%. "RFC" is used within Wikipedia, meaning that phrases like "...since the RFC 4 years ago" should not be wrapped. RFC can also mean Rangers F.C. and Randers FC and whatever is meant in 1976 Eastern Suburbs Roosters season. Since the total population of the category is only about 2,000 pages, I recommend that AWB, rather than a bot, be used on this category, with "RFC \d+" text that does not refer to an IETF RFC be wrapped in nowiki tags until the magic words are turned off. – Jonesey95 ( talk) 16:08, 26 March 2017 (UTC) reply
I've been working on killing these magic links for a very long time. About a year ago I looked at doing doing the ISBN updates. Those are easy enough to do automatically or semi-automatically. PMID as well, probably. But RFC has only a few hundred instances and many are quirky/tricky. I would recommend doing RFC magic links manually. -- MZMcBride ( talk) 23:42, 3 April 2017 (UTC) reply
There was no objections in the call for comment, and those who commented agreed it was a good idea. The bot should likely restrict itself to something like (pseudo regex)
convert deprecated magic links to template usage - BRFAPrimefac ( talk) 14:31, 29 May 2017 (UTC) reply
Maybe the should all ISBN fixing bots use the same edit summary? -- Magioladitis ( talk) 08:24, 7 June 2017 (UTC) reply
@ Primefac: I'm going to try to get this moving this week. Where do you see the current status of this task as being? There appear to have been substantial fixes, so I think this should probably have one more trial to ensure everything looks good. First, what ever happened with the ref tag issue? If that was never fixed, please experiment with a negative lookbehind for the start of a ref tag. That should fix the issue pretty easily. ~ Rob13 Talk 09:38, 13 June 2017 (UTC) reply
Approved for extended trial (400 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Please perform 200 edits each. ~
Rob13
Talk
16:30, 13 June 2017 (UTC)
reply
Approved. ~
Rob13
Talk
23:25, 18 June 2017 (UTC)
reply
Operator: Primefac ( talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 14:10, Friday, March 24, 2017 ( UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): AWB
Source code available: AWB
Function overview: Replace magic words with templates
Links to relevant discussions (where appropriate): RFC, RFC follow up
Edit period(s): one time run initially, then maybe once a month until the magic link functionality is removed
Estimated number of pages affected: 378k (369k
ISBN, 7000
PMID, 2150
RFC)
Exclusion compliant (Yes/No): yes
Already has a bot flag (Yes/No): yes
Function details:
ISBN\s([0-9 -]{9}[0-9 -]{4}?[0-9 -]{3}?[0-9X])(?!-|[0-9])
→{{ISBN|$1}}
RFC\s([0-9]{1,4})(?![0-9])
→{{IETF RFC|$1}}
PMID\s([0-9]+)
→{{PMID|$1}}
I've tried to account for every situation, and the only potential issue I can see is that strange ISBN values (specifically, mis-typed 11- or 14-digit ISBNs) get captured, but I genuinely can't figure out how to account for every single possibility without having an unnecessarily long regex (basically accounting for every possible combination of hyphen-and-number). Primefac ( talk) 14:10, 24 March 2017 (UTC) reply
ISBN\s([0-9 -]+[0-9X])
PMID\s([0-9]+)
RFC\s([0-9]+)
(modified from the above to take any number following the RFC) Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete., 25 edits of each.
Headbomb {
talk /
contribs /
physics /
books}
18:07, 24 March 2017 (UTC)
reply
|id=
might have more than one identifier, but when there's only one, it's much better to make use of the existing CS1 parameter. See
Help:Citation Style 1#Identifiers for which exists.\|(\s*)id(\s*)=(\s*)\{\{(arxiv|asin|bibcode|biorxiv|citeseerx|doi|eissn|hdl|isbn|ismn|issn|jfm|jstor|lccn|mr|oclc|ol|osti|pmc|pmid|rfc|ssrn|zbl)\s*\|([^(\}\|)]*)\}\}\s*(\s*(\||\}\}))
replace |$1$4$2=$3$5$6
. It works pretty well last I checked.
Headbomb {
talk /
contribs /
physics /
books}
22:49, 24 March 2017 (UTC)
replyI note the regular expression for MediaWiki's magic ISBN linking is, I believe, \bISBN(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs})++((?:97[89](?:-|(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs}))?)?(?:[0-9](?:-|(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs}))?){9}[0-9Xx])\b
. Most of the complexity is different kinds of whitespace that are allowed. It also skips anything that's inside link text, inside an HTML tag (e.g. in an attribute), and inside a bare URL. If you use something else, you're liable to miss and/or include things that are/aren't currently linked. The RFC and PMID links are almost right, they're \b(?:RFC|PMID)(?:\t| |&\#0*160;|&\#[Xx]0*[Aa]0;|\p{Zs})++([0-9]+)\b
. The actual code for all of it is
here, although it runs after much of the wikitext has been parsed.
Anomie
⚔
23:03, 24 March 2017 (UTC)
reply
Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete., 100 edits each. Please update summary to indicate that it may also do things beyond magic link deprecation.
Headbomb {
talk /
contribs /
physics /
books}
00:58, 25 March 2017 (UTC)
reply
@ Primefac: Fully done now. Also, is there a reason why AWB genfixes shouldn't be enabled during the main run? Headbomb { talk / contribs / physics / books} 02:41, 25 March 2017 (UTC) reply
doi:([\S]*)/([A-Za-z0-9.]*)
→ {{doi|$1/$2}}
, and for PMCID I've got PMCID:PMC([0-9]+)
→ {{PMC|$1}}
.
Primefac (
talk)
02:58, 25 March 2017 (UTC)
reply
( edit conflict) I came up with
(?<!(\[\[|=))doi:?\s*10\.([^(\s)]+?)(\.)?(\])?<(\/)?ref>
replace {{doi|10.$2}}$3$4<$5ref>
, followed by(?<!(\[\[|=))doi:?\s*10\.([^(\s)]+?)(\.|,|\])?(\s)
replace {{doi|10.$2}}$3$4
There is some GIGO cases, but I've yet to see a place where this screws up anything. Errors tend to be that GIGO will result in a missed conversion, than wrong conversion. Headbomb { talk / contribs / physics / books} 12:53, 25 March 2017 (UTC) reply
I think this large scale task is a good opportunity that the bot also does AWB's general fixes so that all secondary tasks are done in addition to the main task. Many other bots have been running general fixes while making a primary task. It's a great time. -- Magioladitis ( talk) 14:11, 25 March 2017 (UTC) reply
Based on my involvement with ISBNs I proposed same additions to the current regex. For instance, to allo catch tabs and dashes. There is also a lift of bad ISBns found in Wikipedia:WikiProject Check Wikipedia/ISBN errors. -- Magioladitis ( talk) 14:27, 25 March 2017 (UTC) reply
Did this task at other wiki. Some notes:
Looking through the RFC category, it looks like the false positive rate for pages in this category is well over 1%. "RFC" is used within Wikipedia, meaning that phrases like "...since the RFC 4 years ago" should not be wrapped. RFC can also mean Rangers F.C. and Randers FC and whatever is meant in 1976 Eastern Suburbs Roosters season. Since the total population of the category is only about 2,000 pages, I recommend that AWB, rather than a bot, be used on this category, with "RFC \d+" text that does not refer to an IETF RFC be wrapped in nowiki tags until the magic words are turned off. – Jonesey95 ( talk) 16:08, 26 March 2017 (UTC) reply
I've been working on killing these magic links for a very long time. About a year ago I looked at doing doing the ISBN updates. Those are easy enough to do automatically or semi-automatically. PMID as well, probably. But RFC has only a few hundred instances and many are quirky/tricky. I would recommend doing RFC magic links manually. -- MZMcBride ( talk) 23:42, 3 April 2017 (UTC) reply
There was no objections in the call for comment, and those who commented agreed it was a good idea. The bot should likely restrict itself to something like (pseudo regex)
convert deprecated magic links to template usage - BRFAPrimefac ( talk) 14:31, 29 May 2017 (UTC) reply
Maybe the should all ISBN fixing bots use the same edit summary? -- Magioladitis ( talk) 08:24, 7 June 2017 (UTC) reply
@ Primefac: I'm going to try to get this moving this week. Where do you see the current status of this task as being? There appear to have been substantial fixes, so I think this should probably have one more trial to ensure everything looks good. First, what ever happened with the ref tag issue? If that was never fixed, please experiment with a negative lookbehind for the start of a ref tag. That should fix the issue pretty easily. ~ Rob13 Talk 09:38, 13 June 2017 (UTC) reply
Approved for extended trial (400 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Please perform 200 edits each. ~
Rob13
Talk
16:30, 13 June 2017 (UTC)
reply
Approved. ~
Rob13
Talk
23:25, 18 June 2017 (UTC)
reply