This is really interesting research, but it is there any chance of differentiating longterm editing rates between the different reasons why articles get deleted? I'm assuming that we never convert spammers and rarely get authors of non-notable articles to shift to more notable subjects, but it would be great to know. Ϣere SpielChequers 17:02, 21 March 2011 (UTC)
In running some number, this report suggests we had about 1774 active editors in the 30 days prior to the report, of which roughly half would be admins. (This contradicts official stats here [1] which says we should have abotu 3500 active editors) Am I reading this wrong, or do we really have this few active editors? If so, current decline trends would put our project about couple years off from collapse.... Based on your average retention number, and looking at the number of editors we are projected to gain the next year [2], we will gain 160 editors (100+ edits) in the next 12 months, while during the same period we would loose 212 editors, based on the average 12% annual loss we've been having since since 2007, for a net loss of 10% of editors annually, which would put us at zero editors in about 12 years, or half our current numbers in 6 years. I don't know a metric for the minimum number of editors needed to maintain the project, but I would take a stab and say we can't do with much less than 1000. I would love to explore this more. — Charles Edward ( Talk | Contribs) 17:58, 1 April 2011 (UTC)
I think a good metric to judge our minimum requirements would be to count up all the administrative actions - rollbacks, blocks, page protections, page patrols, etc, over a certain period of time to determine the average number conducted daily. Then determine the average number of those actions done by existing editors. Determine how many average editors are needed to conduct that average amount of maintenance actions, and we will have a metric for the minimum number of editors required to keep the project from falling into disrepair. This would allow us to determine an approximate threshold at which the project will begin a gradual collapse. I am not sure how to compile those numbers though. :( — Charles Edward ( Talk | Contribs) 18:30, 1 April 2011 (UTC)
Since the proposal is for requiring autoconfirmed, it is also of interest to have information on the number of users who have created articles/redirects as part of their first ten edits. Also taking into consideration registration date would be even more encompassing, but I'm not sure of the feasibility. At least for the ten edits, can this be done ? Cenarium ( talk) 21:40, 4 April 2011 (UTC)
I ran an updated version of the script a few days ago on February 2011 accounts (source code coming soon). It collected a lot more data, which is always nice, but unfortunately means it no longer lends itself to simple representations. Instead of looking at the first edit, it looked at the first 5 edits (I ran it before Cenarium's comment above). It recorded:
Remember that this is looking at 1-2 month retention rate, not 7 or so months as last time. A brief-ish summary of the results is below:
If there's any other combinations you'd like to see, let me know. For those with a toolserver account, the data is available on the s1-user server in the u_alexz_p database. If others are interested, I could potentially provide a dump of the table. I'll write a GUI frontend for this eventually. I plan to re-run the script each month to see if there's any change in long-term retention. Mr. Z-man 03:21, 9 April 2011 (UTC)
Rather than just presenting tables, interesting although they are, it's possible to do some statistical analysis on the data. It's important to do this, because there's a strong relationship between two of the variables which may predict whether new editors stay or go: creating a new article is much more likely to lead to a deletion than making an edit.
One way of analysing the data is via "logistic regression". (The article Logistic regression seems, like a lot of mathematical articles, to be written entirely for mathematicians and so useless to most readers.) In logistic regression on this example (the data in the table in the article) we ask whether we can predict the likelihood of Staying (i.e. not being classified as Gone) from three potential predictor variables: Create (was the first action to create a page?), Deleted (was the page deleted?) and Main (was the first edit to a mainspace page?).
Step 1: I applied logistic regression using all three predictor variables (via this web page). The result showed, as is obvious from the data, that collectively all three predictor variables do predict whether users Stay or not (p < 0.0001). However, the third variable, Main, did not have a statistically significant impact (p = 0.40), i.e. it did not add to the prediction by the other two.
Step 2: So I removed Main, and used only two predictor variables, Create and Deleted. Both were statistically significant predictors of Stay at the 5% level (Create with p = 0.02, Deleted with p < 0.0001). (For the statistically minded, I also checked the interaction and it was non-significant.)
One way of looking at the size of the effect of the two predictor variables is to look at the "odds ratio". The odds ratio for Create is 1.16, i.e. holding constant the effect of Deleted, users whose first action is Create are predicted to be 16% more likely to stay (however the 95% confidence range on this is very wide, from 2% to 32% more likely to stay). The odds ratio for Deleted is 0.22, i.e. holding constant the effect of Create, users whose first edit is deleted are predicted to be 78% less likely to stay (the 95% confidence range on this is 72% to 83%).
So, perhaps not surprisingly, although there is a smallish positive effect of being allowed to create new articles, it's far outweighed by the negative effect of deletion.
But what we need to know is why there was a deletion. If it was because the new editor was trying to advertise or push some POV, then we should not be concerned if they never came back. If it was because of some technical failings in understanding Wikipedia policies, then we should be concerned, and should do something about it. Peter coxhead ( talk) 13:55, 12 April 2011 (UTC)
I was reviewing comments I made in the " autoconfirm to create" discussion, and came across the results page for your newusers script. I'm normally hands-off when it comes to userpages, but I took the liberty of formatting the latter section, simply to make the presentation more uniform. There are other, more minor changes as well (e.g. arranging numbers in descending order; punctuation, case... all the things I fuss over). At first, I just wanted to create an anchor to the results matrix, but it devolved into 45 minutes of me playing around with the tables and such. I hope this doesn't breach etiquette, and more importantly, doesn't interfere with further analysis or updates to your dataset. I'll be happy to undo the changes if they aren't to your liking. — VoxLuna ☾ orbit land 19:29, 12 April 2011 (UTC)
What does mainspace mean? 211.225.34.163 ( talk) 12:22, 24 April 2011 (UTC)
This is really interesting research, but it is there any chance of differentiating longterm editing rates between the different reasons why articles get deleted? I'm assuming that we never convert spammers and rarely get authors of non-notable articles to shift to more notable subjects, but it would be great to know. Ϣere SpielChequers 17:02, 21 March 2011 (UTC)
In running some number, this report suggests we had about 1774 active editors in the 30 days prior to the report, of which roughly half would be admins. (This contradicts official stats here [1] which says we should have abotu 3500 active editors) Am I reading this wrong, or do we really have this few active editors? If so, current decline trends would put our project about couple years off from collapse.... Based on your average retention number, and looking at the number of editors we are projected to gain the next year [2], we will gain 160 editors (100+ edits) in the next 12 months, while during the same period we would loose 212 editors, based on the average 12% annual loss we've been having since since 2007, for a net loss of 10% of editors annually, which would put us at zero editors in about 12 years, or half our current numbers in 6 years. I don't know a metric for the minimum number of editors needed to maintain the project, but I would take a stab and say we can't do with much less than 1000. I would love to explore this more. — Charles Edward ( Talk | Contribs) 17:58, 1 April 2011 (UTC)
I think a good metric to judge our minimum requirements would be to count up all the administrative actions - rollbacks, blocks, page protections, page patrols, etc, over a certain period of time to determine the average number conducted daily. Then determine the average number of those actions done by existing editors. Determine how many average editors are needed to conduct that average amount of maintenance actions, and we will have a metric for the minimum number of editors required to keep the project from falling into disrepair. This would allow us to determine an approximate threshold at which the project will begin a gradual collapse. I am not sure how to compile those numbers though. :( — Charles Edward ( Talk | Contribs) 18:30, 1 April 2011 (UTC)
Since the proposal is for requiring autoconfirmed, it is also of interest to have information on the number of users who have created articles/redirects as part of their first ten edits. Also taking into consideration registration date would be even more encompassing, but I'm not sure of the feasibility. At least for the ten edits, can this be done ? Cenarium ( talk) 21:40, 4 April 2011 (UTC)
I ran an updated version of the script a few days ago on February 2011 accounts (source code coming soon). It collected a lot more data, which is always nice, but unfortunately means it no longer lends itself to simple representations. Instead of looking at the first edit, it looked at the first 5 edits (I ran it before Cenarium's comment above). It recorded:
Remember that this is looking at 1-2 month retention rate, not 7 or so months as last time. A brief-ish summary of the results is below:
If there's any other combinations you'd like to see, let me know. For those with a toolserver account, the data is available on the s1-user server in the u_alexz_p database. If others are interested, I could potentially provide a dump of the table. I'll write a GUI frontend for this eventually. I plan to re-run the script each month to see if there's any change in long-term retention. Mr. Z-man 03:21, 9 April 2011 (UTC)
Rather than just presenting tables, interesting although they are, it's possible to do some statistical analysis on the data. It's important to do this, because there's a strong relationship between two of the variables which may predict whether new editors stay or go: creating a new article is much more likely to lead to a deletion than making an edit.
One way of analysing the data is via "logistic regression". (The article Logistic regression seems, like a lot of mathematical articles, to be written entirely for mathematicians and so useless to most readers.) In logistic regression on this example (the data in the table in the article) we ask whether we can predict the likelihood of Staying (i.e. not being classified as Gone) from three potential predictor variables: Create (was the first action to create a page?), Deleted (was the page deleted?) and Main (was the first edit to a mainspace page?).
Step 1: I applied logistic regression using all three predictor variables (via this web page). The result showed, as is obvious from the data, that collectively all three predictor variables do predict whether users Stay or not (p < 0.0001). However, the third variable, Main, did not have a statistically significant impact (p = 0.40), i.e. it did not add to the prediction by the other two.
Step 2: So I removed Main, and used only two predictor variables, Create and Deleted. Both were statistically significant predictors of Stay at the 5% level (Create with p = 0.02, Deleted with p < 0.0001). (For the statistically minded, I also checked the interaction and it was non-significant.)
One way of looking at the size of the effect of the two predictor variables is to look at the "odds ratio". The odds ratio for Create is 1.16, i.e. holding constant the effect of Deleted, users whose first action is Create are predicted to be 16% more likely to stay (however the 95% confidence range on this is very wide, from 2% to 32% more likely to stay). The odds ratio for Deleted is 0.22, i.e. holding constant the effect of Create, users whose first edit is deleted are predicted to be 78% less likely to stay (the 95% confidence range on this is 72% to 83%).
So, perhaps not surprisingly, although there is a smallish positive effect of being allowed to create new articles, it's far outweighed by the negative effect of deletion.
But what we need to know is why there was a deletion. If it was because the new editor was trying to advertise or push some POV, then we should not be concerned if they never came back. If it was because of some technical failings in understanding Wikipedia policies, then we should be concerned, and should do something about it. Peter coxhead ( talk) 13:55, 12 April 2011 (UTC)
I was reviewing comments I made in the " autoconfirm to create" discussion, and came across the results page for your newusers script. I'm normally hands-off when it comes to userpages, but I took the liberty of formatting the latter section, simply to make the presentation more uniform. There are other, more minor changes as well (e.g. arranging numbers in descending order; punctuation, case... all the things I fuss over). At first, I just wanted to create an anchor to the results matrix, but it devolved into 45 minutes of me playing around with the tables and such. I hope this doesn't breach etiquette, and more importantly, doesn't interfere with further analysis or updates to your dataset. I'll be happy to undo the changes if they aren't to your liking. — VoxLuna ☾ orbit land 19:29, 12 April 2011 (UTC)
What does mainspace mean? 211.225.34.163 ( talk) 12:22, 24 April 2011 (UTC)