WIKIPEDIA:BOTS REQUESTS FOR APPROVAL COREVA-BOT 2

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Request Expired.

Coreva-Bot

Operator: Excirial^{(
Contact me,
Contribs)} 18:42, 7 January 2009 (UTC) reply

Automatic or Manually Assisted: Fully automatic, with the possibility to manually override the bots behavior if desired.

Programming Language(s): VB.net,

Function Summary:

Query Wikipedia API every X minutes (Currently: 30 minutes) for new pages

If bot is cold started, fetch newpagelist with the last X (Idea: 500-1000) pages. (See: Note 1)
If the bot is running, only fetch the list of new pages since the last visit.

If the bot has found any new pages, load the page content and start to parse it.

Bot will parse the content to determine if any maintenance tags have to be placed.

If there is a need to place a maintenance tag, add the tag to the article, and resume with the next article.

Edit period(s) (e.g. Continuous, daily, one time run): Continuous

Edit rate requested: 1 edit per new page tops. (Estimated 10 edits a minute tops, currently a test setting that is open to be lowered.)

Already has a bot flag (Y/N): (Not applicable, new bot)

Extended content

Function Details:
|} Note: Coreva-Bot had a previous bot request located Here. Prototypes of the previous idea behind Coreva showed that it would be virtually useless. This request is for a functionally completely different bot (But with an identical name).

Coreva's main task is placing maintenance tags on new pages that require them, similar to the way most newpagepatrol's work their beat. Coreva's will regularly(every 5-10 min) check the newpage list for new article's, fetch the new article's content, parse the content (See: Parser Table) and finally update the article, adding required maintenance tags.

Just like the previous Coreva, this one should also be quite light on server resources. The bot queries the server's new page list every 5-10 minutes, and (So far) each article re quire's two server queries (getting the article's content, and a query to check if the article is an orphan). Category counts, link counts et cetera are handled internally by the bot. Additionally, the bot will require one database write to add the template's (In case this is required). The estimated edit rate for the bot will be 2 edits per minute on average. (See: Note 2)

Coreva is not a miracle, and will never replace a living newpage patrol. Coreva cannot patrol for WP:CSD and does not understand hoaxes, advertising or vandalism. However, a lot of article's slip of the newpage list without having any form of maintenance tags. About half the pages on the newpagelist show as not being patrolled, and even though this is a very rough guess, this equals more then 2.000 pages a day. (See: Note 3) Since adding maintenance tags is thoroughly boring work, i think Coreva could spare quite a few patrols a bit of boredom :).(Unlike CSD tags which require at least some form of using your brain, maintenance tags require nothing more then checking 20 indicators, most of them nothing more then: Present/Not present)

Finally, just like the old Coreva, its still pretty much work in progress, which is only done in spare time. While the progress on this Coreva is much faster then on the previous one, i assume it will still take a few months before it is capable of being a fully automated bot. Even if it would be technically capable to do so, it will not be a fully automatic bot until i tested it thoroughly (few weeks i guess) in assist mode, which means Coreva would only me feedback on what tag it would place on every page it checks. This way any annoying mistakes in the parser should be ironed out, while at the same time it allows to improve the parser code.

Parser Table

This table gives an overview of the templates Coreva will be placing on the articles, along with the current criteria configuration for doing so. Note that this is still pretty much in beta stage; templates may be added and removed depending on tests. Also, the criteria are still based on very simple algorithm's. Coreva's tests are conducted on a very small and varied set of locally stored articles, thus criteria are still general. In their current form they should, however, produce very little false positives (But would likely have quite a few false negatives). So all in all: Work in progress! (See: Note 4)

Tag	Criteria	Comment
Wikify	No internal links	Amount of internal links = 0
Uncat	No categories in the article.	Amount of categories = 0
Unreferenced	No references in the article	Not Ref tags, or references/notes header detected.
Footnotes	Article contains a standard "Notes" or "References" header, but no Ref tags	-
Internal Links	Article contains less then (Amount of words / amount of links) internal Links	Percentage not yet set.
Orphan	Article is linked by (0) articles.	-
Stub	Article size is smaller then X	Suggestion: <1kb / 100 words / 1000 characters (inc. spaces)
Sections	Article contains to little sections or readabilities sake (Note: section equals a linebreak)	< 6 sections counter && (Amountofsections * 2500) > articlesize.
Too many links	To be determined	For this i still need to analyze guidelines, and the appropriate category.
Too many categories	Amount of categories > X	X: 10? 20? 30? Depends quite a bit on the article size. Perhaps a base of, say, 10, and another cat for every x words. (For example, World War II has 42 cats, but its a huge article).

Notes

Note 1: A second idea is to let the bot store his last query time permanently, and query all new (Non patrolled) pages since the bot went off line. These pages could then be processed at a lower priority, meaning that they would only be processed once the bot runs out of pages to process, with a limit on the amount of pages processed each minute. (So, if the bot limited itself to 5-10 edits a min tops, it would mean that 3-8 old low priority pages could be processed a minute). Being in the CET timezone, this would translate to a 250 or so page queue generates overnight that could be processed during the reminder of the day.

Note 2: I am currently in doubt if the bot should notify the user with a template in case maintenance tags are placed, encouraging the article creator to recheck the page while it is still "Warm". This would double the bots database writes, and at the same time i cannot predict if user are adverse to being templated, or if anyone would chance an article (Or ask for help). On the other side: If a user created a page on the basis of web site's, warning them no sources are added could prevent a hell lot of wasted time for other users to verify all the article's content from web searches.

Note 3: This is based on statistics from May, 2006. During that time Wikipedia got 3600 new articles a day. Nowadays the number is most certainly quite a bit higher, but due to the difference between peak and normal hours, its quite hard to make a guess based on special:newpages. :)

Note 4: Its rather obvious, but since i didn't mention it: Coreva does not add template's to pages marked for CSD, and does not add templates that already exist.

Discussion

Previous discussion
ThaddeusB
FYI, I was considering making a similar bot, so am familiar with the concept. Here are some things to consider: many articles are not made in one edit. The bot should avoid tagging anything until a reasonable amount of time has passed since the last edit (perhaps 1 hour) to ensure that it doesn't tag works in progress. if the bot isn't going to make any effort to determine CSD criteria, it probably shouldn't mark pages as patrolled since a human look might still be needed. human editors often mark pages as patrolled with putting any maintenance tags (i.e. they only verify it is a legit article subject), so it is probably worth while to check all new articles. using {{ articleissues}} is preferable to placing a bunch of different cleanup tags. -- ThaddeusB ( talk) 20:21, 7 January 2009 (UTC) reply Thanks a lot for the suggestions! While i wrote down most of the bots proposed structure, design, code layout and classes alocation, there are masses of fine details i did not consider yet, so bringing them to my attention to them is certainly helping a lot :) Many articles are not made in one edit. The bot should avoid tagging anything until a reasonable amount of time has passed I have spend some time determining what was better: Tagging (Near) real time, or tagging with a delay. And i am not certain what is better yet (Probally i won't decide until Coreva goes on autopilot). There are some advantages to tagging with a delay: Bad pages are likely already filtered out and people get a chance to finish their article. The disadvantage however, is that this modus operandi requires a few more internal checks to see if (Say, an hour) has passed since the time the article was created. At time Coreva could end up with a fairly full queue, but no permission to reduce that queue due to restrictions. There are also advantages to tagging real time: First off, its easier and more strait forward to implement. Another advantage would be that if the article's creater were notified about the tags with a template, it would be much more likely that they are still online. Disadvantage would be that it can be annoying to people that a page they are still working on gets tagged. Then again, thats the way each and every human new page patrol works, and it would also give an indication of things that are still needed within the article. If the bot isn't going to make any effort to determine CSD criteria, it probably shouldn't mark pages as patrolled since a human look might still be needed. Good point, and one that i didn't really think about yet. While most of the basic functionality is already there (Get the page list, store in a database, parse a working set and check the pages what tags need to be added), the entire "Return the data to the server" part of the bot is still non existent. When i got to that part i will surely pay attention to this suggestion/wisdom :). human editors often mark pages as patrolled with putting any maintenance tags (i.e. they only verify it is a legit article subject), so it is probably worth while to check all new articles. The only time i considered doing that was when the bot got a cold boot, meaning it has been offline longer then momentarily. Since the bot will run on a home PC, its likely the bot will be going down a few hours a day, which is while i am asleep. From empirical data it shows that 500ish pages are created during that time. When originally writing the RFBA that seemed like a lot, so leaving out anything patrolled was a good way to make the load less. Then again, even if i limited the bot to 5 edits a min, at an average edit rate of 2 pages a min (On working away the new pages) it would still just take 3 hours to remove the backlog, and only just over an hour if i set it to 10. So its indeed worthwhile to check all pages. using {{ articleissues}} is preferable to placing a bunch of different cleanup tags. I fully agree with that. Friendly works the same way, just like the majority of the rest of the tools. Its just a few extra lines of code (Or maybe, even less) to work this way. Sorry for the length of this reply, as it ended up being nearly the size of a separate RFBA. Truth is that i tend to consult suggestions while im busy building the bot. Sometimes it might take a while to implement a certain feature i thought about earlier, and by then, i might already have forgotten half the things i thought about by then. So, just like with coding, good documentation while you are busy can be quite helpful :) Excirial^{( Contact me, Contribs)} 21:34, 7 January 2009 (UTC) reply
Mr.Z-man
Some of these tags could have issues being added by bots. Some comments: Unreferenced - I would expand this to not tag if it has any external links as well, as those could potentially be refs, especially by people unfamiliar with Wikipedia style. Uncat - Make sure to get the prop=categories along with page text to get categories added by templates Internal Links - This could potentially be screwed up by templates and stuff. A 3 sentence article with an infobox and a couple stub templates might only need 2 links, but it'll have a lot of text Orphan - Should this really be added minutes after creation? I think it would be better to give people some time, especially for this, as it requires editing other articles. Stub - This (and most others) will have to make sure to exclude disambig pages Sections - Again, could be screwed up by infoboxes, other templates, and refs Too many categories - I'm not sure if this can really be determined reliably with just numbers. Wikipedia's category system can be really f'd up in some cases. Ideally, no article should need 40 categories, but with no category intersections, its unfortunately necessary in some cases. As for user talk messages, it has benefits and drawbacks. Messages customized based on the tags could encourage people to fix the issues rather than scaring them off, but they could be scared off anyway by a bot screaming at them about what's wrong with their article. It could also annoy established users (see next comment). You could put the messages on the article talk page instead, but people might not see them there. You might want to check the edit count of the creator, established users could probably be given the benefit of the doubt and their articles skipped, or just skip users on pages like User:JVbot/patrol whitelist. {{ articleissues}} would probably be best if its adding more than 2 tags. Make sure to add the date to tags, so other bots don't have to come through and do it. -- Mr. Z-man* 21:06, 7 January 2009 (UTC) reply So many suggestions to improve Coreva already, Thanks both of you, i really appreciate your thoughts on this! Unreferenced Very, VERY helpful suggestion. I have seen that kind of referencing more then once, and i was already none to happy with the crude detection for the references. I can simply modify one of the filters a little bit, and it would do the trick. Uncat Good point. Coreva can't detect those in its current incarnation. While i could request the page with the templates expanded, that would only create potentional problems with article length and so on. Internal Links Its would not make much of a difference i think, but i need research this a little more before i draw conclusions. Most infoboxes could use some form of linking themselves. Either way, the values i was thinking about are 0,5-2% ish. The Bill Gates infobox is 190 words for example, meaning that the article would need between 0,95 and 3,8 links. Seeing the infobox itself contains a few links, it should not be that hard to reach the threshold :). Stub Indeed, Indeed, just as they need to avoid redirect pages. As long as a disambig template is present, it should not be to hard to detect a disambig page. Sections Im translation "Section" to everything that does not contains a linebreak(Enter) in between them. Then again, there is no real need to keep an enter between the text, and the closing characters of an infobox, table, etc. Not sure how i am gonna solve that problem just yet, but ill think of something eventually (I hope :-)) Too many categories 99.5% of the article's don't have categories in the first place, so this i never expected this late addition to the list to get much attention. The only reason i included it anyway was because i saw a few (Read: 10 a year tops) articles where mostly business owners ended up adding their article to each and everything even remotely relevant, causing articles to end up with a cartload of categories. As for normal, every day use, i tend to set the threshold to add this template quite high, so the absolute majority of the articles will simply never be tagged with this; Just the weird cases i saw every now and then. This especially since there is no definitive guideline on the amount of categories i know. Orphan This tag is in no way easy to solve, especially not for a new user. The problem with not adding this tag is that it is exceedingly difficult to find an orphaned page later on. Special:LonelyPages is no longer updated, and since the article has no incoming links, the only way to get to the page is typing the title into the search box. I am (Excuse my ignorance if i am mistaken) not aware of any method to check the amount of links to an article other then trough querying the API, or through a search in one of the special pages. Especially on pages with somewhat less regular titles ("Accessible publishing" would not be something i would put between link brackets for one) or titles that a bit different to regular writing language(Not a good example, but if someone would write "Hearth Disease" then (If it would be orphaned) "Cardiovascular disease" would never come up) it could very well be that articles never get linked as time passes. Thus, the only solution i know is placing an orphan tag. Though i would very much like to hear suggestions or comments on this reasoning, as im not especially happy with this situation either. Talk Messages I kept a tally of this once, and virtually every unreferenced, uncat and wikify tag ended up on a quite new user's page. Template addition based upon editcount could end up being a very good idea. A new user could receive a friendly template containing information about the placed tags (If i would end up including talk page messages i intent to write step-by-step guides for each template coreva adds some day, so that new users don't get slapped with a lengthy guideline). Experienced users could be skipped, or could receive a short notifier one line notifier that coreva added templates. Either way, their should, and will be, be a way to opt out indefinitely if implemented. Again, my thanks for these suggestions! Excirial^{( Contact me, Contribs)} 22:27, 7 January 2009 (UTC) reply
X!
Question - Why load the API every 5-10 minutes? Why not just use he IRC RC feed? X clamation point 06:09, 8 January 2009 (UTC) reply Several reasons. The first reason is quite simple: It is a hell lot less work to code. I already needed an XML parser for the other API queries, so it was incredibly easy to let it fetch the newpage list from the api. Furthermore its as easy as querying the server, and receiving the data. If i was to use the IRC Feed, i would have to code an entirely new feature based upon IRC parsing (Connect to irc, login to the nickserver, join the correct channel, parse the format used there). All in all it would end up being a lot of extra work, with the additional disadvantage i never coded anything involving IRC in the first place. Second reason is that the IRC feed offers no advantages over the API in Coreva's case. Coreva offers no advantage if it gets the newpagelist realtime. Hence, 5-10 minutes wait time allows CSD patrols to add CSD templates to the article, which prevents Coreva from parsing useless pages. Excirial^{( Contact me, Contribs)} 10:42, 8 January 2009 (UTC) reply

Reopening request

Over the past two ish months the amount of time i could spend on Wikipedia was drastically reduced due to other duties, causing a certain lapse in coreva's development. Another issue halting development progress was caused by an old programmers trap: Building a patched together prototype which should be trown away once i had a proof of concept it actually worked, and instead keeping the prototype and resuming work on it, which eventually let to a horrible code mess and a completely non understandable program. In the past month i finally found the time and willpower to use a step trough debugger throughout the entire program to decipher and salvage the mess as much as possible, before rewriting coreva from scratch, sans for a few salvaged functions that actually worked.

The actual working of the bot have changed very little from the table i added above - i dropped the STUB, TOMANYCATS and TOMANYLINKS due to them being prone to false positive. I am currently testing a module that can detect peacock pages (Based upon statical analysis, weighted word lists andsome basic calculations); So far it work fine when comparing featured article's versus peacock articles (1 false positives on 270 correct tags), but the calculation algorithm makes to many mistakes on small articles, so its disabled for now.

Et Cetera

The bot language switched from C#.net to VB.net to force recoding, rather then copy paste while rewriting it.
Par suggestion, the bot dates every tag, and uses articleissues over single template when more then 2 tags are placed.
The bot won't check redirects, pages marked as disambiguation and pages marked for CSD, nor any pages that are already removed.
The bot won't double-tag. It can detect already placed {{ArticleIssues}} tags; Similarly it can detect single level templates, along with every listed alias of those templates.
Since the bot won't be running all day it remembers the article it last tagged before shutting down. When started again it will proceed where it left of (Can be manually reset after vacations, etc)
By default the bot checks the api for new pages every 30 minutes; The bot will keep those stored in a database and will form a buffer of articles to tag, with older article's having priority. Tests showed it is extremely rare for the bot to tag article's younger then 5 minutes. I hope this covers the "Pages take time" argument.
Edit rates are currently locked at 1 edit every 6 seconds. I am open for any advice regarding this rate.

Excirial^{(
Contact me,
Contribs)} 21:05, 11 June 2009 (UTC) reply

Bots that add tags to articles tend to be controversial. See Wikipedia:Bots/Requests for approval/Erik9bot 9, where a proposal to add {{ unreferenced}} to all unreferenced articles had to be modified in order to overcome objection from a number of editors. Some people feel that visible cleanup templates on articles detract from the reader's experience, and should not be used. – Quadell ^(
talk) 14:46, 14 June 2009 (UTC) reply

Well, if there is consensus i will add categories instead of visible templates, but i find myself surprised with the discussion at Erik9bot 9. The majority of the tags added to new articles these days use friendly, which are all visible tags. Similarly the 500k or so articles in WP:BACKLOG all use visual tags; why should a bot that does exactly the same work be subject to an issue called "Ugly Templates"? If this should be changed i much rather see an RFC that changes this wikipedia wide, rather then trough individual judgement that will only create a mass of different styles instead of uniformity. For now i wont pre-emptively change this as i see no consensus on this. Excirial^{(
Contact me,
Contribs)} 07:23, 16 June 2009 (UTC) reply

if there is consensus i will add categories instead of visible template I oppose adding hidden categories without tags unless the bot regularly runs a review of edited articles to remove the categories when the parameters are no longer met. BirgitteSB 14:27, 17 June 2009 (UTC) reply

It will not. Coreva will only tag an article once it is created. Technically it could easily iterate trough every wikipedia article, but that would be inefficient to say the least. The intent of the bot is giving new article's some basic improvement advice and traceability. New editors can technically see what they should improve, and it prevent article's from fading into the great unknown because they are not linked to\from anything else. I assume that was the original criticism basis: The other bot would scan a mass of (Long time) article's and add visible templates to them. Excirial^{(
Contact me,
Contribs)} 14:47, 17 June 2009 (UTC) reply

I can't say whether that was the basis of the original criticism or not. I do not object to adding visible tags. But I disagree with this idea that because people object to that it will be OK to just add hidden categories. Without any sort of visible prompt, people are not going remove these categories as the article matures. The categories will be filled with false positives by the time the backlogs are worked through to these months. BirgitteSB 18:37, 17 June 2009 (UTC) reply

Approved for trial (10 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. This is a very long RfBA, and the specs have changed throughout and are difficult to follow. I think the best way for all parties to understand what this bot would do is to give it a very small trial. – Quadell ^(
talk) 13:12, 18 June 2009 (UTC) reply

It seems that Coreva needs a slight correction - On the first two edits i made a manual issue, setting the bot to "Show only tags" mode which caused it to blank a page. On the next two page i noticed that i made a slight error in the saving code. Coreva accidentally saved the article's it was currently checking to the page is was processing, causing an overwrite with the wrong page contents.

My fault for assuming that a not throughly tested function would just work! It should not take to long to fix this though - ill run Coreva in diagnostics to test it, and after that i will resume the test run. Excirial^{(
Contact me,
Contribs)} 17:42, 18 June 2009 (UTC) reply

{{BotTrialComplete}} - I took the liberty to make a new set of 10 edits after i fixed the above issue - it proved to be a minor issue where i confused two functions, one used for working on the NEXT page, and one that was user for the current page (Causing it to mix up two pages). The new edits are marked 0 to 9; The error on tag 9 - the incorrect addition of a sections template - has already been fixed. Excirial^{(
Contact me,
Contribs)} 18:24, 18 June 2009 (UTC) reply

Another issue, the incorrect dating of maintenance tags (It didn't include "Date=") has also been solved now. Excirial^{(
Contact me,
Contribs)} 07:54, 19 June 2009 (UTC) reply

Approved for trial (20 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Okay, let's have another go. – Quadell ^(
talk) 22:38, 22 June 2009 (UTC) reply

Trial complete. Again a few slight bugs which have been fixed. In retrospect it might have been wiser to develop Coreva for a single template and add additional functionality later on - at least it would have prevented the need for repeated trials.

* Bug 1: Incorrect tagging with Unref templates. - Not really a bug. I optimized the regex and missed a character in the process, causing it to always fail.

Umm, of course it's not a bug if you forsaw the false positives, but not very considerate to leave it to others to fix them! Why dont you manually review flagged articles first? Sparafucil ( talk) 00:33, 29 June 2009 (UTC) reply

Of course i did not foresee the errors, if i did, i would have fixed them before running Coreva wouldn't you think :)? What i means with "Not really a bug" is that i already had the correct code, but that i made a small copypaste error in it causing it to malfunction. As for cleaning this one up, i went after those myself and reverted them? Excirial^{(
Contact me,
Contribs)} 06:53, 29 June 2009 (UTC) reply

Ah, we're perhaps talking about different things then, and I'm sorry for the sarcasm. What got my temper up was [ ] in the article space, labled as a trial edit. Why should checking for references be done by a bot? The article clearly is based on the Encyclopedia article given at the end. Sparafucil ( talk) 23:01, 29 June 2009 (UTC) reply

* Bug 2: Tagging of a disambig page. - Caused by the use of a specialized disambiguate template ({{Hndis}}). Coreva now checks for these, and every other template listed onto that templates "See Also" section. Never even knew those excisted :).

* Bug 3: Dating categories incorrectly.- Apparently an uppercase "Date" is not accepted as a parameter, so i use "date" now, which is accepted.

Excirial^{(
Contact me,
Contribs)} 20:37, 28 June 2009 (UTC) reply

If you haven't put one in (I couldn't see a mention from skimming through this page), then you should add a limit to how soon after creation the bot it "allowed" to tag. Because it's possible (although highly unlike since it doesn't run very often) that it would get in the way of deletion tagger etc. It should only mark articles which have been around for a few hours or so. - Kingpin¹³ ( talk) 19:33, 2 July 2009 (UTC) reply

* Bug 4: adding requests for inline references to one or two sentence article. This makes it hard to read the article. Particularly after smackbot moved the huge banner to the center of the article. Footnotes are useful for the reader, but the banner has more text than the article, and with so little text, footnoting is not so urgent. Please look at the articles and watch-list them while your bot is in its trial phases. If the banner overpowers the article it should not be there. Consider that very short stubs (about 32 words of texts) with banners with the same number of words may not need the banner and a copy of bots modifying the article. -- 69.226.103.13 ( talk) 07:11, 4 July 2009 (UTC) reply

{{ OperatorAssistanceNeeded}} Any news on bug four or the status of this request? MBisanz ^talk 22:04, 18 July 2009 (UTC) reply

Quick status summary

Since this RFBA is quite old, it contains a lot of information which is no longer completely up to date. Besides, it has become so long that it is somewhat unreadable, thus here is a summary for quick reference.

General

What will Coreva-Bot's task be?': Coreva-Bot will function as a newpage patrol, checking article's for problems. Once it has found an issue it will add the appropriate maintenance templates to the articles.
How will Coreva operate? If coreva is started the first time - that is, its database backend is empty - it will query the server for the last 500 new pages list and save that list to the backend; If coreva already has data in its back end it will query the server for all pages created since it last ran (5000 limit, 500 for now as it is still not marked as a bot). Coreva will then load pages and check pages, filling its save buffer. The speed at which pages are checked depends on the amount of pages in the buffer - more pages means longer intervals. Every 6 seconds the buffer will be checked if there are pages to save - in case they are the oldest page will be saved with templates added.

Tagging Article's
What will Coreva-Bot template for?: {{ Uncategorised}}, {{ Unreferenced}}, {{ Footnotes}}, {{ Wikify}}, {{ Orphan}}, {{ Sections}}, {{ internallinks}}. Statical analysis shows that the {{ peacock}} template is prone to errors, which is why it is disabled indefinitely.
What restrictions apply for tagging: Coreva will not template any pages marked as CSD - but it will template PROD and AFD pages. Coreva will not tag removed pages. Coreva will not tag pages marked as Disambiguations (Includes the basic disambig template, all aliases and specialized disambiguate templates such as {{tl:hndis}}), It will not tag pages twice with the same template, in case maintenance templates already exist,
What are the criteria for each template to be added?: (Note: These criteria are constantly improved - Do note that they only grow stricter trough). Templates will not be added if one is already present.

Uncategorized: The article has 0 categories - Note that any category, including maintenance categories, count to this limit.
Unreferenced: The article contains no, or an empty reference header and no <ref> tags.
Footnotes: The article contains a reference header with any non whitespace content, and no <ref> tags. Also, the templates {{ 1911}} and {{ JewishEncyclopedia}} must not be present.
Wikify: The amount of internal links is 0.
Orphan: The article has no other article's linking to it.
Sections: Exponential mathematical formula
Internallinks: The article has less internal links then one for every 1000 characters. Note that, while being a rather unsophisticated filter, this works pretty well.

Technical and operational limits

Article's younger then an hour are not checked - instead the bot goes to sleep mode until it is allowed to tag again.
Coreva tracks pages tagged - unless manually reset it will not tag the same page twice.
Edit rate will never exceed 10 edits per minute; Mostly the bot will be around 7 or 8 ish edits, depending on the amount of pages in its buffer.
The bot will query the server once on startup, and then again once every 30 minutes for new pages. Each page checked requires two queries: One for the article content, and once to check if the article is an orphan. In case the article needs to be updates the bot will save the page once for every required article.

Todo
Coreva is quite near being "finished", at least the integral part of it. Due to the amount of templates the bot handles its filters will likely be constantly tweaked to reflect new templates or guidelines. In the future i might submit another feature request that in case Coreva runs out of new pages, it will check trough older pages at snale speed. Other then this the only thing that remains is some work on the GUI and efficiency of certain sections - none of which should change it controversially.

Re-Opened (Yet again Sigh)

Due to some unforeseen circumstances i have been almost completely inactive the last 3 or so months, causing this bot request to expire yet again. Finally having found some spare time to work on this bot again, i would like to reopen this RFBA.

As for the current status: Bug number 4 is now solved, Coreva will only add the footnotes template to pages of substantial length. It will also converts ampersands and other reserved HTML characters correctly now before saving the page, and I also updated the regex's used to determine if a template should be placed; thus reducing the amount of false positives. Excirial^{(
Contact me,
Contribs)} 22:11, 30 October 2009 (UTC) reply

The intention is to tag 6 minute old articles with maintenance templates? Why? Is there community consensus that an editor should have only 5 minutes to write before a bot tags the article? What might be missing, imo, is a few more minutes to write an article.

Personally, I'd let the bot finish it if my editing was interfered with in this manner. It takes hours to write an article. Sometimes I post a stub first. I'd like to see the community consensus for these tasks, for the templates to be added by a bot, and for the amount of time before adding the templates. It seems hostile if I understand the time frame correctly.

Also, how many templates will it add? It seems to say it will only add one, but which one of the many? Or will it add more? -- 69.226.106.109 ( talk) 02:41, 31 October 2009 (UTC) reply

Im not certain where you got the 6 minutes part, as Coreva is hard coded not to check any article's younger then an hour - If it runs into article's younger then an hour it will automatically disengage from tagging them until they are the required age. In that time the bot could very slowly iterate trough wikipedia's older articles to see if they have any issues - though for now it just halts itself until it is allowed to tag again.

You said you'd check for new pages every 5-10 minutes, so I guessed 6 minutes after the new page appeared it could have a tag on it. Is an hour a time that the community considers reasonable?

(See below)

The "Minimal time" part is of course easily changeable to a longer or shorter duration (It used to be 30 minutes actually), but in this case i chose for an hour so that any new contributer still has a chance to see them - and thus receives some input on how to improve this article. Keep in mind that new page patrols using FRIENDLY or similar software exhibit the same behavior as the bot - only faster. For example this article was tagged within 15 minutes and this one was tagged within 40 minutes. Note that these are just two random article's i angled up; I have seen plenty being tagged within 10 minutes. Similarly quite a few are left completely not tagged while they still need quite some work.

I don't think it works that way, and it's hard to follow the reasoning behind, well, human editors do this and it's worse so than what the bot will be doing...

Due to the way patrol tools work article's tend to get tagged sooner rather then later as article's are mostly processed on a near real time speed. During the development i tended to mimic already present tools and procedures as much as possible as those are obviously legal to use within the guidelines. Coreva has the added advantage it can simply query the API to receive a list of recent changes, so from that perspective it matters little if the wait time is an hour, a day or a week. As far as i know there is no community consensus regarding tag time with maintenance templates - if there is please tell me. It takes rather short since it only means adjusting a single number.

How about finding out about some reasonable length of time by getting some feedback from the community? An hour seems reasonable to me, unless someone is still working on the article right then. I created an article of average difficulty from one of the lists of missing articles to see how long I usually work on it before I would leave it for a while, Chaetopterus and an hour seems okay, because I usually add more sources to my articles than most editors. But I would feel more comfortable about the timing if it were in lines with voiced community guidelines. I do appreciate that you considered how users usually go about it. -- 69.225.3.198 ( talk) 21:36, 2 November 2009 (UTC) reply

Certainly, it is always good if a bot has some form of community consent, and i will inform tomorrow at the village pump what users think a reasonable time would be. As said before my timing was mostly based upon given editors some time, while at the same time allowing new users to receive some feedback. However, seeing you raised the issue that a bot tagging halfway can be annoying im more then happy to change that - Personally i always work in user space unless its a small stub i can just create in minutes.

Yes, I think asking the community is good for what would be a reasonable time for a bot tagging new articles.-- 69.225.3.198 ( talk) 23:29, 2 November 2009 (UTC) reply

Asked here. Feel free to comment if you are interested in it. :) Excirial^{(
Contact me,
Contribs)}

As for the templates, Coreva will add one for every issue it detected. However, when multiple issues are found it will mimic WP:Friendly and add the grouped {{ articleissues}} template instead. Last, Coreva will only check an article once - after that it will not check it again unless i manually reset the bot. I this optic it is not that different from a new page patrol, who might tag your article with maintenance templates as well. Neither Coreva, nor patrols are mindreaders, which means both do not know if you intend to continue work on an article later on. In both cases removing the templates in an edit you were already making is sufficient to keep the tags off. Excirial^{(
Contact me,
Contribs)} 10:52, 31 October 2009 (UTC) reply

So, if a user writes a single line stub on an organism it could essentially be tagged with so many templates in an hour after it has been written that the reader cannot find the text in the article? IMO this is the equivalent of a speedy deletion, if you make it impossible to read the article by obscuring the text with tags? -- 69.225.3.198 ( talk) 16:15, 2 November 2009 (UTC) reply

The only thing Coreva usually signals a well written stub article for is the lack of references; Diego de Miguel for example came trough without any tag at all. Of course what you mention is possible; On the other side Domohani Kelejora High School was tagged with three templates but this was because the article was plain text without any wiki formatting at all, which meant it really needed work done. Excirial^{(
Contact me,
Contribs)} 18:16, 2 November 2009 (UTC) reply

So, what's a well-written stub? -- 69.225.3.198 ( talk) 21:36, 2 November 2009 (UTC) reply

I wrote a quick tool this evening based upon Coreva, which allows me to evaluate any article within seconds, while giving feedback what Coreva would have done if it encountered it (And i tell you, its a blessing as it is more versatile then Coreva in its analysis, meaning that i can easily test and improve the detection algorithm).

Now, as for a well written stub: Your own Chaetopterus article would not have received any tag since its first revision. Also, pressing Special:Randoma while looking for stubs this were a few results: Aigües - unreferenced. Bērze parish - unreferenced. Paddy Forde - none. McCulley Township, Emmons County, North Dakota - none. Cigaritis - unreferences. These are of course older articles, so i took 5 successive new article's as well: Belarusian Independence Party - None. Infimenstrous - Ignored for CSD. Aventure en Australie (TV episode) - Uncategorised, Unreferences. The Reincarnation of Peter Proud (1973 novel) - Uncategorised, Unreferences, Orphan. Jonas Cutting Edward Kent House - Orphan.

There was one false positive related to the sections template, which i traced back to a typo while coding the analysis tool, rather then in Coreva. Excirial^{(
Contact me,
Contribs)} 23:09, 2 November 2009 (UTC) reply

Use {{ Article issues}} not {{ Articlesissues}}. Rich Farmbrough, 19:21, 9 November 2009 (UTC). reply

From looking at these, I think I would like to have broader community consensus for the orphan tagging, and for the tagging in general. The time looks like it should be longer, say 3 hours during some periods, but this may be flexible. I don't know if the question you asked is sufficient for understanding the community's desire to tag in general. I am concerned, as I said, about adding tags to certain types of generally stubby articles. Many stubs about living things are just a single line and a taxobox, while Cigaritis would be a better article if referenced, and should be referenced, and its lack of references should be called to someone's attention, adding a no references banner across the top will overpower the text and essentially, imo, make the article useless to the reader. It might as well be deleted.

Can articles be categorized unreferenced without the huge banner, or can it be put on the bottom of the page? Where are these categories of unreferenced articles, by the way, I would like to add references to many of them. -- 69.225.3.198 ( talk) 09:26, 4 November 2009 (UTC) reply

On the "unreferenced" issue, the bot's stated mechanism doesn't seem nearly sophisticated enough. "The article contains no, or an empty reference header and no ref tags" misses many potential referencing techniques. Generally, I doubt the bot is going to be an effective way to process for this tag; when, for instance, there are raw links in the article, it will be difficult for the bot to differentiate between ones that are useful references and ones that are not. Christopher Parham (talk) 15:06, 4 November 2009 (UTC) reply

(69.225.3.198) It is of course possible to add the category to the article without adding the "Visual" template, but i believe community consensus is against doing so because the requirements for improvement should be visible (If i remember a discussion some time ago correctly). The reasoning for this was that readers should be aware of the issues with the information they are presented. As for the category: it is located under WP:backlog, or more specifically under Category:Articles lacking sources. Currently just 188,583 are tagged, so by tomorrow you could be done with the backlog :P.

(Christopher Parham) Which is why im constantly busy improving coreva's detection algorithms. The majority of the article's either has no references or references which are added correctly as stated in WP:MOS. There are indeed other techniques such as linking websites within the middle of the text (Either with an external link or just textual), dumping them all at the bottom without a section header or ref tags, and i can go on for a while with these.

Most of these can however, be reliably detected. A regular expression can easily filter websites out of the article, even if they are not marked as an external link. Seeing these kind of pages are slightly rare i do not have the amount of test subject i normally like, but i was considering marking pages with multiple external links in the text for cleanup. Alternatively it is possible to ignore article's which seem to have links. This would certainly give false negatives, but it would still tag plenty of article's correctly. Currently a substation part of the article's end up being completely untagged in the first place, so it would already improve the situation, even if it does not solve it. Excirial^{(
Contact me,
Contribs)} 16:34, 4 November 2009 (UTC) reply

You should have a look at Erik9Bot's BRFA to see some more ways articles can contain references that aren't immediately apparent. Also \([^)]* p+\. is a good string to look for. Rich Farmbrough, 19:21, 9 November 2009 (UTC). reply

That is indeed quite the handy RFBA. Im glad to see that Coreva covers most of the points it mentions, but there are a few things that Coreva doesn't do, or at most does differently. It seems that the mentioned bot accepts any form of link starting with http:// as a reference, regardless of where the link leads. Perhaps A valid strategy as it is quite difficult to have a false positive this way (Though false negatives would likely increase). Searching for ISBN is something i certainly have to add, similarly with "List of" / "Lists of" check, but this is something i was already planning to add.

If anything i would rather not be forced to create a separate hidden category in which Coreva lists possibly unreferenced articles. If that would be the case i think i prefer dropping the check for the unreferenced template as it doesn't justify the extra work implementing it would create. I will be integrating the suggestions from that RFBA soon, but for now i became a little sidetracked with the idea that i could use Coreva to track dead references as well. The last few days i mostly spend my time tinkering on a prototype that i could integrate with Coreva. Seeing Coreva will likely have quite some downtime due to the finite amount of article's it has to check, it seems that a second activity could fit neatly into that time. Excirial^{(
Contact me,
Contribs)} 22:51, 9 November 2009 (UTC) reply

* I would be dead against repeating what Erik9bot did. We have a hidden category with 100,000 + articles in it: I have seen people go through their "baliwicks" just hoiking it out.

* In terms of the tag overpowering the article I have offered

and this could be made smaller, used for orphaned too. Uncat is not a problem, that is one backlog that is under control.

* there is a question in my mind about the usefulness of "orphan" anyway. I shall raise that at VP.

Rich Farmbrough, 21:05, 18 November 2009 (UTC). reply

I like it much better than the current one. Living thing stubs, though, aren't likely to be removed even with this tag, and, again, for one sentence and a taxobox it's still overpowering. Can it be put at the bottom of the article? I think it's better to have an article flagged in some way, by a banner like this for example, if it has no references, because encyclopedia articles, in general, should not be unreferenced. I'm just never sure who's fixing these unreferenced articles, or if the banners are just permanent parts of the articles. -- IP69.226.103.13 ( talk) 11:30, 19 November 2009 (UTC) reply

Regarding the {{ Footnotes}} tag, what would the article do with new articles that use paranthetical references, and have a references section but do not use the <ref> tag? For instance take John Vanbrugh and assume the notes section (not related to referencing in this case) didn't exist; how would the bot approach this article? Christopher Parham (talk) 15:12, 7 December 2009 (UTC) reply

A user has requested the attention of the operator. Once the operator has seen this message and replied, please deactivate this tag. (user notified) Seeing as Excirial is inactive since November, I will probably be closing this as expired in a few day, until such time as he returns. MBisanz ^talk 06:39, 27 December 2009 (UTC) reply

Request Expired. MBisanz ^talk 01:28, 3 January 2010 (UTC) reply

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.