Supervised automatic
PHP, Kingbot+AWB
Automated assessment of class of articles for Projects as Stub or Start. Main target is Country Projects where there's a lot of "town" articles transwiki'd and Unassessed. Where it's possible to extract a population size from an infobox, in many cases bot will assign Importance based on population. More description on the Talk page, although the numbers have been updated.
Typically run once per Project that wants it.
12 per minute
Already has a bot flag (Y/N): N/A
It scans the lists of articles covered by a project and opens each of them in turn at >1.0 second intervals using Special:Export. It extracts various statistics such as the number of <ref> tags and number of images, and also what infoboxes an article uses. If the Template:Infobox CityIT box is present, it goes to the next stage.
The bot extracts the population of a commune if possible, and proposes the following assessment of importance :
Class is based primarily on length of article. In the case of Italian comuni, many have a 2kb demographic timeline which doesnt affect assessment, so the length of this is calculated and deducted from the length of the article. Classes are as follow :
These categories may vary depending on the Project - for instance the France Project assess all towns >100,000 people as High by definition.
The list of assessments once generated is then examined by me, and any obvious tweaks applied. The list is then fed into Kingbot and AWB. The Italy Project has about 1500 articles to do, some of the other Projects may have up to 5000.
In support of the primary task, the bot may extract Comune infoboxes from the Italian versions of articles, clean up and translate eg months, and use subst:Comune to apply a CityIT infobox to the English article.
The nice thing about this thing is that you can see how it handles articles that have already been assessed manually. On the Italy Project there are 2529 assessed articles with CityIT infoboxes. The bot would assess 1 current High as a Mid ( Tivoli, Italy), 14 current Mids as Lows ( Tolfa would be #7 by size); 123 would not be assessed. Given that I would be eyeballing them manually to check for "obvious" misses, and that in any case one would expect that "more-important-than-their-small-population-would-suggest" villages are less likely to have remained unassessed, I think a worst-case false positive rate of 0.6% is pretty acceptable.
On the class assessments, 30 out of 240 current Starts would be assessed as Stubs, but of those 30, 24 carry Stub tags in the article, and they're all in the fuzzy grey area between Start and stub - #15 is Mozzanica to give you an idea. 2206 out of 2277 stubs would be assessed as such, the rest would be unassessed. 16 current Stubs out of 2277 would be assessed as Starts - again it's that grey area, Artegna is #8 and I'm not too worried about that. 194 Starts would be recognised as such.
So I'm pretty comfortable with the accuracy, particularly since this is not in the main space and assessment is a bit of an art in any case. Even the false positives could easily be assessed the other way by a human assessor.
Only 4 out of 49 fail to be assigned a Class -
Alanno,
Casoli,
Popoli and
Silvi. The last one is an unusually well-referenced stub ;-/, the other three are genuinely debatable.
There were 8 Starts -
Pescasseroli,
San Giovanni Lipioni,
Vasto,
Campli,
Montefino,
Torricella Sicura,
San Benedetto dei Marsi,
and
Avezzano, the rest were Stubs. Avezzano is much the most interesting one, it comes up on one out of two Stub tests, and one out of two Start tests (so Start wins). Again it's a slightly malformed article, I think Start probably is the right assessment but there's not a lot in it.
For your convenience, I sorted them so that the 'interesting' ones were done first, and after Spoltore there's nothing but Stubs in order of increasing stubbiness. Of course, since as a trial bot I was using AWB in manual mode, I couldn't resist the human temptation to muck about with things. Which means that at the start, I included the three articles assigned neither Class nor Importance, just so you could see what was going on. And then I further messed around with Talk:Popoli trying to make it extra clear what was going on, but instead just screwed things up - I'll sort it out as soon as this is over. Like I say, I was just trying to give you the "overview" of the articles that the bot is looking at - in reality Popoli, Silvi and Alanno would never have made it into the list fed into Kingbot. Let me know if you need more doing. FlagSteward ( talk) 00:41, 9 March 2008 (UTC) reply
Supervised automatic
PHP, Kingbot+AWB
Automated assessment of class of articles for Projects as Stub or Start. Main target is Country Projects where there's a lot of "town" articles transwiki'd and Unassessed. Where it's possible to extract a population size from an infobox, in many cases bot will assign Importance based on population. More description on the Talk page, although the numbers have been updated.
Typically run once per Project that wants it.
12 per minute
Already has a bot flag (Y/N): N/A
It scans the lists of articles covered by a project and opens each of them in turn at >1.0 second intervals using Special:Export. It extracts various statistics such as the number of <ref> tags and number of images, and also what infoboxes an article uses. If the Template:Infobox CityIT box is present, it goes to the next stage.
The bot extracts the population of a commune if possible, and proposes the following assessment of importance :
Class is based primarily on length of article. In the case of Italian comuni, many have a 2kb demographic timeline which doesnt affect assessment, so the length of this is calculated and deducted from the length of the article. Classes are as follow :
These categories may vary depending on the Project - for instance the France Project assess all towns >100,000 people as High by definition.
The list of assessments once generated is then examined by me, and any obvious tweaks applied. The list is then fed into Kingbot and AWB. The Italy Project has about 1500 articles to do, some of the other Projects may have up to 5000.
In support of the primary task, the bot may extract Comune infoboxes from the Italian versions of articles, clean up and translate eg months, and use subst:Comune to apply a CityIT infobox to the English article.
The nice thing about this thing is that you can see how it handles articles that have already been assessed manually. On the Italy Project there are 2529 assessed articles with CityIT infoboxes. The bot would assess 1 current High as a Mid ( Tivoli, Italy), 14 current Mids as Lows ( Tolfa would be #7 by size); 123 would not be assessed. Given that I would be eyeballing them manually to check for "obvious" misses, and that in any case one would expect that "more-important-than-their-small-population-would-suggest" villages are less likely to have remained unassessed, I think a worst-case false positive rate of 0.6% is pretty acceptable.
On the class assessments, 30 out of 240 current Starts would be assessed as Stubs, but of those 30, 24 carry Stub tags in the article, and they're all in the fuzzy grey area between Start and stub - #15 is Mozzanica to give you an idea. 2206 out of 2277 stubs would be assessed as such, the rest would be unassessed. 16 current Stubs out of 2277 would be assessed as Starts - again it's that grey area, Artegna is #8 and I'm not too worried about that. 194 Starts would be recognised as such.
So I'm pretty comfortable with the accuracy, particularly since this is not in the main space and assessment is a bit of an art in any case. Even the false positives could easily be assessed the other way by a human assessor.
Only 4 out of 49 fail to be assigned a Class -
Alanno,
Casoli,
Popoli and
Silvi. The last one is an unusually well-referenced stub ;-/, the other three are genuinely debatable.
There were 8 Starts -
Pescasseroli,
San Giovanni Lipioni,
Vasto,
Campli,
Montefino,
Torricella Sicura,
San Benedetto dei Marsi,
and
Avezzano, the rest were Stubs. Avezzano is much the most interesting one, it comes up on one out of two Stub tests, and one out of two Start tests (so Start wins). Again it's a slightly malformed article, I think Start probably is the right assessment but there's not a lot in it.
For your convenience, I sorted them so that the 'interesting' ones were done first, and after Spoltore there's nothing but Stubs in order of increasing stubbiness. Of course, since as a trial bot I was using AWB in manual mode, I couldn't resist the human temptation to muck about with things. Which means that at the start, I included the three articles assigned neither Class nor Importance, just so you could see what was going on. And then I further messed around with Talk:Popoli trying to make it extra clear what was going on, but instead just screwed things up - I'll sort it out as soon as this is over. Like I say, I was just trying to give you the "overview" of the articles that the bot is looking at - in reality Popoli, Silvi and Alanno would never have made it into the list fed into Kingbot. Let me know if you need more doing. FlagSteward ( talk) 00:41, 9 March 2008 (UTC) reply