Material was moved here from Wikipedia_talk:Version_1.0_Editorial_Team/Assessment#Overhaul_and_rewrite_of_the_assessment_scheme and related discussions on Wikipedia talk:Version 1.0 Editorial Team/Assessment. Walkerma ( talk) 17:35, 12 May 2008 (UTC)
For the assessment table, I'm wondering if we couldn't say more or less the same thing using fewer words, here's a suggestion. I've included a description of a "list", which was commented out of the table; feel free to delete it if it's not relevant.
Label | Criteria | Reader's experience | Editor's experience | Example |
---|---|---|---|---|
B {{ B-Class}} |
Anything that's definitely better than the "Start" category, but doesn't meet higher standards. | It gives the impression that a typical reader would learn something. | Improve the article by trying to meet higher standards. | Jammu_and_Kashmir (as of October 2007) has a lot of helpful material but needs more. |
Start {{ Start-Class}} |
A good article that is still weak in many areas. Has at least a particularly useful picture or graphic, or multiple links that help explain or give examples of the topic, or a subheading that covers one topic more deeply, or multiple subheadings that suggest material that could be added to complete the article. | Useful to some, provides more than a little information, but many readers will need more. | Major editing is needed, not a complete article. | Real analysis (as of November 2006) |
Stub {{ Stub-Class}} |
Either a very short article or a rough collection of information that needs a lot of work. | Possibly useful. It might be just a dictionary definition. | Any editing or additional material can be helpful. | Coffee table book (as of July 2005) |
List {{ List-Class}} |
An article that meets the definition of a Stand-alone List. It should contain many wikilinks, with descriptions. | There is no one way to make a list, but it should be logical and useful to the reader. | Lists can be anything from a stub to a Featured List. | List of aikidoka (as of June 2007) |
- Dan Dank55 ( talk) 19:45, 4 April 2008 (UTC)
There have been two proposals recently relating to assessment, and both seem to be reasonable (IMHO). They would both involve some rewriting and recalibrating, and therefore I think we should consider both proposals at the same time. I'm adding a third proposal, which is in effect how I think the first two would best be implemented together. There's also a fourth, which came up in discussions, and which I'll throw in for good measure. Walkerma ( talk) 18:01, 12 April 2008 (UTC)
( Described right above here) We should simplify the basic definitions of each class. The descriptions are quite detailed, but that may mean simply that people don't bother to read them properly. We could simply do a copyedit and chop out a lot of wording; that would make them easier to follow, but we may lose some of the rigour if actual examples or nuances of meaning are lost. Hence my proposal for a "summary style" approach; this will allow us to have very clear, simple definitions for routine use. Walkerma ( talk) 18:01, 12 April 2008 (UTC)
Be happy to help - Dan Dank55 ( talk) 02:00, 13 April 2008 (UTC)
See Wikipedia_talk:Version_1.0_Editorial_Team/Work_via_Wikiprojects#Assessment for the original proposal. We have a good scheme that works well, but there are variations in standards. It should be possible to sharpen the boundaries of the scheme by including additional examples to indicate specific detail about the levels (the lowest standard for Start-Class, vs. the highest standard for Start-Class). We may also be able to consider how we handle the different aspects of assessment (article length, quality, technical aspects, aesthetics, etc). We have one very knowledgable contributor offering to help, and I think we should use this opportunity to make the scheme more rigorous. Any thoughts? Walkerma ( talk) 18:01, 12 April 2008 (UTC)
Happy to help with style and language issues - Dan Dank55 ( talk) 02:09, 13 April 2008 (UTC)
This was my suggestion for dealing with the first two proposals, which at first glance would appear to be irreconcilable. How can we make the scheme even more nuanced and rigorous, yet make it simpler to understand? I think we can accomplish this through use of the summary style approach: Have one short, succinct description of the scheme, but then have a sub-page (or sub-pages) to give more detail. That way, someone who just wants to "get the general idea" can do so, but the reviewer who is agonising over where something is B or Start can look for some more detailed guidance. Is this a good approach to the problem?
The scheme is now well into its third year, and some of the standard questions and proposals keep coming up over and over again:
I think it's about time we wrote a simple FAQ to deal with these questions; for every one person who posts on one of these, there are probably ten who are simply baffled and leave.
We have Wikipedia:WikiProject Council/Assessment FAQ, which we can always expand for this purpose. Titoxd( ?!? - cool stuff) 20:42, 11 May 2008 (UTC)
Happy to help with style and language issues - Dan Dank55 ( talk) 02:14, 13 April 2008 (UTC)
Our scheme has grown from around 2000 articles when the scheme was automated two years ago, to around 1.1 million today - that's more than the growth of Boston in 1776 to the Boston of today. The scheme is holding up remarkably well, IMHO, but I think we need to revamp the "architecture" a bit. Walkerma ( talk) 18:01, 12 April 2008 (UTC)
Hi all. On Dank55's suggested simplication. I would certainly keep the current version with a more detailed description. However, for those who have become accustomed to it, I doubt they will refer to it in detail often, and an abbreviated version could be used to complement (not replace) the more detailed version -- i.e. a kind of quick reference version that people can go to if they prefer. An option to consider anyhow.
From experience, the examples tend to be the most powerful part of the process, and as I've said elsewhere I think it is excellent you have examples. The description of an article (like the description of most complex things) can be interpreted in different ways, and most importantly here, more or less strictly/harshly or leniently, and with different assumed interpretation of the various elements. Don't get me wrong, I think it's very important to describe, to orient to what features people need to look at, but then at the end of the day someone can always ask: so what does that actually look like? Just as a picture tells a thousand words, so does an example of an article!
So a quick reference with the same examples could be useful for those assessing a lot of stuff, or even those who assess just a few things after becoming familiar with scheme.
The more detailed version is also likely to be important in cases where there is some dispute. Holon ( talk) 09:26, 15 April 2008 (UTC)
I'd strongly advise against using different examples/exemplars in different versions of the generic scheme (not that anyone has suggested it) because I have seen empirical cases in which exemplars are changed in a scheme that otherwise remains the same, and there is a severe impact on ratings (e.g. becomes much harder to be deemed in a category). The relevant research was thorough and very controlled. I can't say exactly to what extent it applies to this scheme, but in general it's better to keep things consistent as much as possible (provided of course they're sound and working!).
On that note, I'd also be careful changing the examples over time if you want consistent grading over time. Having said that, there are ways to link new to old if this is a must and I can advise and help. It takes some time and effort to make sure changing examples doesn't change the relative difficulties of the 'grades' though.
However, it may be very useful to have specific examples for more unusual kinds of article. In these cases, for the sake of comparability, I would advise people in the relevant projects to very carefully select examples that are as close to the same quality as those in the generic scheme as possible (for the same reason, they're very powerful). The basic principle is this. In cases where there are unique considerations and/or some of the generic considerations are not applicable (or less so), people have to take into account the considerations when deciding the grade of a given article. Now, assuming some effort goes into this, you may as well do it once then save having to do it every time thereafter. However, becuase the exemplars may have a strong impact on the assessments, ideally the first time a decision is made, an exemplar should be selected that is considered as close to the generic one as possible. Another thing worth considering.
Cheers Holon ( talk) 09:26, 15 April 2008 (UTC)
In keeping with the general principle of having a simple scheme with flexibility for cases that require or warrant special attention, I want to add that additional borderline examples could also be listed on a separate page, only to be used when necessary (e.g. if start vs B is a difficult call).
The scheme gives broad classifications, which is fine for many purposes. If people are interested in adding, I would recommend selecting candidate articles and experienced assessors quickly doing pairwise comparisons between the candidates and existing exemplars. Given relatively little data, I can analyse and report back scaled locations in order so a decision can be made about borderline below/above articles. If there is enough data, I can also advise which were most consistently judged, which are better to use.
Just as extra increments on a tape measure (or any instrument) provide additional precision in a region of a continuum, so additional exemplars provide additional precision in the region (border between adjacent classifications). So additional exemplars in selected regions allow greater precision when desired, provided they've been carefully selected and calibrated. Incidentally, this also answers the question in FAQ about more or less grades. The thing people are generally thinking when they ask this is: I think there should be more precision (or less, though I wouldn't recommend less here). So this is one way to get the best of both worlds -- simplicity plus precision when needed. Anyhow, hope this background and explanation is useful but fire away with questions if not. Cheers Holon ( talk) 11:07, 15 April 2008 (UTC)
After looking through the discussion here, I think it might be instructive to describe an 'ideal' process, and to work back from there to what's doable. Pretty much every issue that has come up here is fairly common in assessment. I hope it will be easier to see why from the ideal. Please keep in mind that the work put into what I outline to follow overlaps with normal work on articles anyway, and in the long run would likely make that work far easier by helping to identify what needs to be done and when.
It may also turn out something closer to the ideal is achievable with available skills than I realize. With some ingenuity, Wikipedia could be a first for online ratings en mass by developing a top-class process based on solid foundations! OK, not likely, but possible. It's already considerably better than the crude methods normally used, such as ratings of 1-10 plucked out of the air.
Given the nature of articles, the following process in the ideal is what I would (and do) recommend.
This achieves two things:
The last point is key when articles are near a threshold for going from one "grade" to the next.
Probably more important than all else an ordered set of examples provides a clear picture for editors of what it takes for an article to progress toward the highest standard.
A common reaction to this is that it's too time consuming because most are used to easy, but poor, rating processes (e.g. pluck a number from 1 to 10 out of the air or a grade based on best guess).
I understand but my standard response is that the payoffs outway the up-front time, often by a large factor, and of course anything worth doing takes some effort and coordination. The only reason most of us can buy a thermometer and easily, yet precisely, measure temperature at will is that a lot of work lies behind its development and construction. Like anything else, including articles on Wikipedia themselves, quality products require some work.
Good measurement instruments and procedures are a cornerstone of industry and technology -- without common standards, many things are impossible in industry. The same idea applies to Wikipedia as a whole. If editors can quickly, yet precisely, measure against calibrated standards as they work and assess articles, there are similar payoffs. There is a lot more clarity on standards and how to know where you are and what it takes to progress.
I believe around a million have been assessed, is that right?
However, it's like everything, it does take time and coordination. Hopefully though, this helps in explaining various issues and how they all fit together in the bigger picture even if nobody actually ends up participating.
I can offer to anyone who wishes to do a small scale test in their own project. I don't think I have yet encountered a case in assessment where people have not found the process informative and useful.
Send me a set of article labels, preferably 15 or more, and I'll send back a spreadsheet with a set of pairwise comparisons to be done: each to be compared with each other and a judgment made about which is better. Do these and send me back the results. I will scale it, put them in order, and tell you how consistent you were overall and tell you which articles were anomalous, if any. Include at least two or three of the articles in the scheme so you will be able to see how the rest scale in between. If you can organize more than one judge to make comparisons, even better, and I can give you feedback on each judge's consistency and the agreement between them.
This should be quite quick for someone who is reasonably familiar with the set of articles, if the assessor only needs to refer to them when it's hard to say which is better. Most judgments should be quick and only a portion take more time. The payoff -- for your project you get a much clearer picture of the way articles progress from worse to better quality, and you have a far more precise basis for judging when an article should move up a grade.
This can be extended across projects. This would simply require choosing a number of articles in your project as well as some from another project also doing a calibration exercise. All articles can be scaled jointly and tests conducted to see how successful the exercise was. It's preferable that the assessors have some knowledge of the other articles, but I doubt it would be necessary for them to be experts on the content to get worthwhile results.
Obviously, this requires coordination if it crosses editors and particularly projects. However, the result could be a nice list across projects of articles from the worst to higher quality that everyone can refer to pluse the benefits to the project mentioned.
So to reiterate, this process is beneficial for
I know there's a lot, but I hope it gives a clear picture of the ideal, and it might spark ideas even if nobody elects to do a trial.
Don't hesitate to criticize -- believe me it's unlikely you'll raise anything I haven't heard many times, and if you do, I'll be grateful for the challenge.
Cheeers all. Holon ( talk) 10:45, 11 May 2008 (UTC)
I noticed something about possibly including List class into the table, so here's something I dug up deep within WP:VG/A. -- .: Alex :. 21:00, 20 June 2008 (UTC)
Label | Criteria | Reader's experience | Editor's experience | Example |
---|---|---|---|---|
![]() {{ FL-Class}} |
Reserved exclusively for lists that have received " Featured list" status, and meet the current criteria for featured lists. | Definitive. Outstanding, thorough list; a great source for encyclopedic information. | No further additions are necessary unless new published information has come to light, but further improvements to the text are often possible. |
|
List {{ List-Class}} |
Reserved exclusively for stand-alone lists, which are articles consisting of a lead section followed by a list. Articles with lists embedded within a small section of an article are prosaic articles and are not considered lists. | Useful to many, but not all, readers. The reader doing in-depth research may find insufficient information or excessive information only useful to fans. | Any editing or additional material can be helpful. Considerable editing is needed to reach Featured list status. In particular, issues of breadth, completeness, and balance may need work. Peer-review would be helpful at this stage. |
|
Sounds like B class combined with Start class to me. However it might be quickest for assessors (and editors) just to define it as 'not a Featured List'. -- Hroðulf (or Hrothulf) ( Talk) 23:10, 20 June 2008 (UTC)
So... how, if at all, is this going to progress? What do we have to work with, and work on? Happy‑ melon 18:27, 30 June 2008 (UTC)
This went live yesterday? It seems from the revision history that changes were still being made until late yesterday, and no-one mentioned here that it was considered complete. I don't think there are any big problems, as it doesn't deviate in substance from WPBIO or MILHIST's criteria, but it would have been nice to know that it was considered ready for launch. -- Hroðulf (or Hrothulf) ( Talk) 11:27, 5 July 2008 (UTC)
I agree with the discussion above that there should be a simplified version of the assessment scheme, even if just as a summary or supplement of the more complete and nuanced version. How about taking it to the extreme? A simple checklist could help give an overview at a glance without having to read (almost) anything! For example, something like the table below:
Class | Completeness | NPOV | References | Figures/tables | Readability | MOS | Headings |
![]() |
100 | 100 | 100 | 100 | 100 | 100 | 100 |
B | 75 | 75 | 75 | 50 | 85 | 75 | 100 |
C | 50 | 50 | 25 | 20 | 70 | 25 | 66 |
Start | 25 | 0 | 0 | 5 | 60 | 5 | 33 |
Stub | 1 | 0 | 0 | 0 | 50 | 0 | 0 |
In this table, each number is a "semi-quantitative" percent progress towards achieving the goal for that column. Of course, the numbers are a bit fuzzy and subjective, and I just made them up, so they are certainly open to discussion. I'm just presenting this table as a proof of concept; the point is to find a way of showing how the various requirements change at different rates from one level to the next, and to provide a checklist for quick reference. Naturally, we would hope that the raters would also read the detailed descriptions, which would help in understanding qualitatively what "75%" means. :) -- Itub ( talk) 16:43, 1 July 2008 (UTC)
With the new B-class criteria (and overall upgrade of standards for B), should we not change "A well written B-class may correspond to the "Wikipedia 0.5" or "usable" standard" to something like "B-class corresponds to the "Wikipedia 0.5" or "usable" standard"? Or at the very least, "usually corresponds"?-- Father Goose ( talk) 20:40, 4 July 2008 (UTC)
Material was moved here from Wikipedia_talk:Version_1.0_Editorial_Team/Assessment#Overhaul_and_rewrite_of_the_assessment_scheme and related discussions on Wikipedia talk:Version 1.0 Editorial Team/Assessment. Walkerma ( talk) 17:35, 12 May 2008 (UTC)
For the assessment table, I'm wondering if we couldn't say more or less the same thing using fewer words, here's a suggestion. I've included a description of a "list", which was commented out of the table; feel free to delete it if it's not relevant.
Label | Criteria | Reader's experience | Editor's experience | Example |
---|---|---|---|---|
B {{ B-Class}} |
Anything that's definitely better than the "Start" category, but doesn't meet higher standards. | It gives the impression that a typical reader would learn something. | Improve the article by trying to meet higher standards. | Jammu_and_Kashmir (as of October 2007) has a lot of helpful material but needs more. |
Start {{ Start-Class}} |
A good article that is still weak in many areas. Has at least a particularly useful picture or graphic, or multiple links that help explain or give examples of the topic, or a subheading that covers one topic more deeply, or multiple subheadings that suggest material that could be added to complete the article. | Useful to some, provides more than a little information, but many readers will need more. | Major editing is needed, not a complete article. | Real analysis (as of November 2006) |
Stub {{ Stub-Class}} |
Either a very short article or a rough collection of information that needs a lot of work. | Possibly useful. It might be just a dictionary definition. | Any editing or additional material can be helpful. | Coffee table book (as of July 2005) |
List {{ List-Class}} |
An article that meets the definition of a Stand-alone List. It should contain many wikilinks, with descriptions. | There is no one way to make a list, but it should be logical and useful to the reader. | Lists can be anything from a stub to a Featured List. | List of aikidoka (as of June 2007) |
- Dan Dank55 ( talk) 19:45, 4 April 2008 (UTC)
There have been two proposals recently relating to assessment, and both seem to be reasonable (IMHO). They would both involve some rewriting and recalibrating, and therefore I think we should consider both proposals at the same time. I'm adding a third proposal, which is in effect how I think the first two would best be implemented together. There's also a fourth, which came up in discussions, and which I'll throw in for good measure. Walkerma ( talk) 18:01, 12 April 2008 (UTC)
( Described right above here) We should simplify the basic definitions of each class. The descriptions are quite detailed, but that may mean simply that people don't bother to read them properly. We could simply do a copyedit and chop out a lot of wording; that would make them easier to follow, but we may lose some of the rigour if actual examples or nuances of meaning are lost. Hence my proposal for a "summary style" approach; this will allow us to have very clear, simple definitions for routine use. Walkerma ( talk) 18:01, 12 April 2008 (UTC)
Be happy to help - Dan Dank55 ( talk) 02:00, 13 April 2008 (UTC)
See Wikipedia_talk:Version_1.0_Editorial_Team/Work_via_Wikiprojects#Assessment for the original proposal. We have a good scheme that works well, but there are variations in standards. It should be possible to sharpen the boundaries of the scheme by including additional examples to indicate specific detail about the levels (the lowest standard for Start-Class, vs. the highest standard for Start-Class). We may also be able to consider how we handle the different aspects of assessment (article length, quality, technical aspects, aesthetics, etc). We have one very knowledgable contributor offering to help, and I think we should use this opportunity to make the scheme more rigorous. Any thoughts? Walkerma ( talk) 18:01, 12 April 2008 (UTC)
Happy to help with style and language issues - Dan Dank55 ( talk) 02:09, 13 April 2008 (UTC)
This was my suggestion for dealing with the first two proposals, which at first glance would appear to be irreconcilable. How can we make the scheme even more nuanced and rigorous, yet make it simpler to understand? I think we can accomplish this through use of the summary style approach: Have one short, succinct description of the scheme, but then have a sub-page (or sub-pages) to give more detail. That way, someone who just wants to "get the general idea" can do so, but the reviewer who is agonising over where something is B or Start can look for some more detailed guidance. Is this a good approach to the problem?
The scheme is now well into its third year, and some of the standard questions and proposals keep coming up over and over again:
I think it's about time we wrote a simple FAQ to deal with these questions; for every one person who posts on one of these, there are probably ten who are simply baffled and leave.
We have Wikipedia:WikiProject Council/Assessment FAQ, which we can always expand for this purpose. Titoxd( ?!? - cool stuff) 20:42, 11 May 2008 (UTC)
Happy to help with style and language issues - Dan Dank55 ( talk) 02:14, 13 April 2008 (UTC)
Our scheme has grown from around 2000 articles when the scheme was automated two years ago, to around 1.1 million today - that's more than the growth of Boston in 1776 to the Boston of today. The scheme is holding up remarkably well, IMHO, but I think we need to revamp the "architecture" a bit. Walkerma ( talk) 18:01, 12 April 2008 (UTC)
Hi all. On Dank55's suggested simplication. I would certainly keep the current version with a more detailed description. However, for those who have become accustomed to it, I doubt they will refer to it in detail often, and an abbreviated version could be used to complement (not replace) the more detailed version -- i.e. a kind of quick reference version that people can go to if they prefer. An option to consider anyhow.
From experience, the examples tend to be the most powerful part of the process, and as I've said elsewhere I think it is excellent you have examples. The description of an article (like the description of most complex things) can be interpreted in different ways, and most importantly here, more or less strictly/harshly or leniently, and with different assumed interpretation of the various elements. Don't get me wrong, I think it's very important to describe, to orient to what features people need to look at, but then at the end of the day someone can always ask: so what does that actually look like? Just as a picture tells a thousand words, so does an example of an article!
So a quick reference with the same examples could be useful for those assessing a lot of stuff, or even those who assess just a few things after becoming familiar with scheme.
The more detailed version is also likely to be important in cases where there is some dispute. Holon ( talk) 09:26, 15 April 2008 (UTC)
I'd strongly advise against using different examples/exemplars in different versions of the generic scheme (not that anyone has suggested it) because I have seen empirical cases in which exemplars are changed in a scheme that otherwise remains the same, and there is a severe impact on ratings (e.g. becomes much harder to be deemed in a category). The relevant research was thorough and very controlled. I can't say exactly to what extent it applies to this scheme, but in general it's better to keep things consistent as much as possible (provided of course they're sound and working!).
On that note, I'd also be careful changing the examples over time if you want consistent grading over time. Having said that, there are ways to link new to old if this is a must and I can advise and help. It takes some time and effort to make sure changing examples doesn't change the relative difficulties of the 'grades' though.
However, it may be very useful to have specific examples for more unusual kinds of article. In these cases, for the sake of comparability, I would advise people in the relevant projects to very carefully select examples that are as close to the same quality as those in the generic scheme as possible (for the same reason, they're very powerful). The basic principle is this. In cases where there are unique considerations and/or some of the generic considerations are not applicable (or less so), people have to take into account the considerations when deciding the grade of a given article. Now, assuming some effort goes into this, you may as well do it once then save having to do it every time thereafter. However, becuase the exemplars may have a strong impact on the assessments, ideally the first time a decision is made, an exemplar should be selected that is considered as close to the generic one as possible. Another thing worth considering.
Cheers Holon ( talk) 09:26, 15 April 2008 (UTC)
In keeping with the general principle of having a simple scheme with flexibility for cases that require or warrant special attention, I want to add that additional borderline examples could also be listed on a separate page, only to be used when necessary (e.g. if start vs B is a difficult call).
The scheme gives broad classifications, which is fine for many purposes. If people are interested in adding, I would recommend selecting candidate articles and experienced assessors quickly doing pairwise comparisons between the candidates and existing exemplars. Given relatively little data, I can analyse and report back scaled locations in order so a decision can be made about borderline below/above articles. If there is enough data, I can also advise which were most consistently judged, which are better to use.
Just as extra increments on a tape measure (or any instrument) provide additional precision in a region of a continuum, so additional exemplars provide additional precision in the region (border between adjacent classifications). So additional exemplars in selected regions allow greater precision when desired, provided they've been carefully selected and calibrated. Incidentally, this also answers the question in FAQ about more or less grades. The thing people are generally thinking when they ask this is: I think there should be more precision (or less, though I wouldn't recommend less here). So this is one way to get the best of both worlds -- simplicity plus precision when needed. Anyhow, hope this background and explanation is useful but fire away with questions if not. Cheers Holon ( talk) 11:07, 15 April 2008 (UTC)
After looking through the discussion here, I think it might be instructive to describe an 'ideal' process, and to work back from there to what's doable. Pretty much every issue that has come up here is fairly common in assessment. I hope it will be easier to see why from the ideal. Please keep in mind that the work put into what I outline to follow overlaps with normal work on articles anyway, and in the long run would likely make that work far easier by helping to identify what needs to be done and when.
It may also turn out something closer to the ideal is achievable with available skills than I realize. With some ingenuity, Wikipedia could be a first for online ratings en mass by developing a top-class process based on solid foundations! OK, not likely, but possible. It's already considerably better than the crude methods normally used, such as ratings of 1-10 plucked out of the air.
Given the nature of articles, the following process in the ideal is what I would (and do) recommend.
This achieves two things:
The last point is key when articles are near a threshold for going from one "grade" to the next.
Probably more important than all else an ordered set of examples provides a clear picture for editors of what it takes for an article to progress toward the highest standard.
A common reaction to this is that it's too time consuming because most are used to easy, but poor, rating processes (e.g. pluck a number from 1 to 10 out of the air or a grade based on best guess).
I understand but my standard response is that the payoffs outway the up-front time, often by a large factor, and of course anything worth doing takes some effort and coordination. The only reason most of us can buy a thermometer and easily, yet precisely, measure temperature at will is that a lot of work lies behind its development and construction. Like anything else, including articles on Wikipedia themselves, quality products require some work.
Good measurement instruments and procedures are a cornerstone of industry and technology -- without common standards, many things are impossible in industry. The same idea applies to Wikipedia as a whole. If editors can quickly, yet precisely, measure against calibrated standards as they work and assess articles, there are similar payoffs. There is a lot more clarity on standards and how to know where you are and what it takes to progress.
I believe around a million have been assessed, is that right?
However, it's like everything, it does take time and coordination. Hopefully though, this helps in explaining various issues and how they all fit together in the bigger picture even if nobody actually ends up participating.
I can offer to anyone who wishes to do a small scale test in their own project. I don't think I have yet encountered a case in assessment where people have not found the process informative and useful.
Send me a set of article labels, preferably 15 or more, and I'll send back a spreadsheet with a set of pairwise comparisons to be done: each to be compared with each other and a judgment made about which is better. Do these and send me back the results. I will scale it, put them in order, and tell you how consistent you were overall and tell you which articles were anomalous, if any. Include at least two or three of the articles in the scheme so you will be able to see how the rest scale in between. If you can organize more than one judge to make comparisons, even better, and I can give you feedback on each judge's consistency and the agreement between them.
This should be quite quick for someone who is reasonably familiar with the set of articles, if the assessor only needs to refer to them when it's hard to say which is better. Most judgments should be quick and only a portion take more time. The payoff -- for your project you get a much clearer picture of the way articles progress from worse to better quality, and you have a far more precise basis for judging when an article should move up a grade.
This can be extended across projects. This would simply require choosing a number of articles in your project as well as some from another project also doing a calibration exercise. All articles can be scaled jointly and tests conducted to see how successful the exercise was. It's preferable that the assessors have some knowledge of the other articles, but I doubt it would be necessary for them to be experts on the content to get worthwhile results.
Obviously, this requires coordination if it crosses editors and particularly projects. However, the result could be a nice list across projects of articles from the worst to higher quality that everyone can refer to pluse the benefits to the project mentioned.
So to reiterate, this process is beneficial for
I know there's a lot, but I hope it gives a clear picture of the ideal, and it might spark ideas even if nobody elects to do a trial.
Don't hesitate to criticize -- believe me it's unlikely you'll raise anything I haven't heard many times, and if you do, I'll be grateful for the challenge.
Cheeers all. Holon ( talk) 10:45, 11 May 2008 (UTC)
I noticed something about possibly including List class into the table, so here's something I dug up deep within WP:VG/A. -- .: Alex :. 21:00, 20 June 2008 (UTC)
Label | Criteria | Reader's experience | Editor's experience | Example |
---|---|---|---|---|
![]() {{ FL-Class}} |
Reserved exclusively for lists that have received " Featured list" status, and meet the current criteria for featured lists. | Definitive. Outstanding, thorough list; a great source for encyclopedic information. | No further additions are necessary unless new published information has come to light, but further improvements to the text are often possible. |
|
List {{ List-Class}} |
Reserved exclusively for stand-alone lists, which are articles consisting of a lead section followed by a list. Articles with lists embedded within a small section of an article are prosaic articles and are not considered lists. | Useful to many, but not all, readers. The reader doing in-depth research may find insufficient information or excessive information only useful to fans. | Any editing or additional material can be helpful. Considerable editing is needed to reach Featured list status. In particular, issues of breadth, completeness, and balance may need work. Peer-review would be helpful at this stage. |
|
Sounds like B class combined with Start class to me. However it might be quickest for assessors (and editors) just to define it as 'not a Featured List'. -- Hroðulf (or Hrothulf) ( Talk) 23:10, 20 June 2008 (UTC)
So... how, if at all, is this going to progress? What do we have to work with, and work on? Happy‑ melon 18:27, 30 June 2008 (UTC)
This went live yesterday? It seems from the revision history that changes were still being made until late yesterday, and no-one mentioned here that it was considered complete. I don't think there are any big problems, as it doesn't deviate in substance from WPBIO or MILHIST's criteria, but it would have been nice to know that it was considered ready for launch. -- Hroðulf (or Hrothulf) ( Talk) 11:27, 5 July 2008 (UTC)
I agree with the discussion above that there should be a simplified version of the assessment scheme, even if just as a summary or supplement of the more complete and nuanced version. How about taking it to the extreme? A simple checklist could help give an overview at a glance without having to read (almost) anything! For example, something like the table below:
Class | Completeness | NPOV | References | Figures/tables | Readability | MOS | Headings |
![]() |
100 | 100 | 100 | 100 | 100 | 100 | 100 |
B | 75 | 75 | 75 | 50 | 85 | 75 | 100 |
C | 50 | 50 | 25 | 20 | 70 | 25 | 66 |
Start | 25 | 0 | 0 | 5 | 60 | 5 | 33 |
Stub | 1 | 0 | 0 | 0 | 50 | 0 | 0 |
In this table, each number is a "semi-quantitative" percent progress towards achieving the goal for that column. Of course, the numbers are a bit fuzzy and subjective, and I just made them up, so they are certainly open to discussion. I'm just presenting this table as a proof of concept; the point is to find a way of showing how the various requirements change at different rates from one level to the next, and to provide a checklist for quick reference. Naturally, we would hope that the raters would also read the detailed descriptions, which would help in understanding qualitatively what "75%" means. :) -- Itub ( talk) 16:43, 1 July 2008 (UTC)
With the new B-class criteria (and overall upgrade of standards for B), should we not change "A well written B-class may correspond to the "Wikipedia 0.5" or "usable" standard" to something like "B-class corresponds to the "Wikipedia 0.5" or "usable" standard"? Or at the very least, "usually corresponds"?-- Father Goose ( talk) 20:40, 4 July 2008 (UTC)