A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Much of the existing Wikipedia research is based on the freely licensed datasets published by the Wikimedia Foundation: Content dumps, pageview numbers, Clickstream samples, etc. But some individual researchers are giving back too. An example for this is the TokTrack dataset, described in an accompanying paper [1] as
Tracking authorship and provenance of Wikipedia article text is by no means a new topic (see e.g. meta:Research:Content persistence). However, the paper's authors assert that their method provides much higher accuracy than earlier efforts such as Wikitrust. One of them, Fabian Flöck, has been studying this problem with other researchers for years (cf. our coverage from 2012 and 2014: "Precise and efficient attribution of authorship of revisioned content", " Better authorship detection, and measuring inequality", " New algorithm provides better revert detection"; the present dataset is generated by their "Wikiwho" algorithm, which also underlies a browser extension called "Whocolor").
What's more, the paper points out that "this data would be exceedingly hard to create by an average potential user" for the entire English Wikipedia due to the computational effort involved ("around 25 days on a dedicated Ubuntu Server [...] with 122 GB RAM and 20 cores"; for comparison, a community-created tool, " WikiBlame", which is linked from every revision history page on English Wikipedia, can take several minutes to find the provenance of an individual token in a single article).
After describing the dataset and the underlying methodology, the paper also briefly presents some insights that can be derived from it about the history of English Wikipedia. First, it looks at the number of added and surviving tokens over time, observing that
It highlights "a surprising spike in Oct. 2002 (also in absolute additions)". Although not mentioned in the paper, this is very likely the effect of bot contributions by User:Ram-Man of US geographical content. Figure 2(b) in the paper also seems to indicate that more than half of these October 2002 additions were still live 14 years later.
Analyzing the "persisting" tokens (that had not been removed within 48 hours) by user group, the authors observe:
The remainder of the paper uses the dataset to study editing controversies. First, the authors define two measures of how controversial an article is, both yielding evolution, Mustafa Kemal Atatürk and Bob Dylan (in that order) as the three most controversial articles as of October 2016 (based on the surviving content at that time only). They also find that "barneys" was the top most conflicted string token.
Lastly, they examine the frequency of edits that undo other edits partially or totally, where the token-based data enables a more sophisticated approach than simpler types of revert analysis. They find that
However, they caution that since "content added by one revision can over (a long) time be corroded by many small changes [...] 'revert' cannot per se be equated with antagonism here, as these numbers include the complete spectrum from minor corrections to full-on opinion clashes and vandal fighting."
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.
Other recent publications that could not be covered in time for this issue include the items listed below. contributions are always welcome for reviewing or summarizing newly published research.
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Much of the existing Wikipedia research is based on the freely licensed datasets published by the Wikimedia Foundation: Content dumps, pageview numbers, Clickstream samples, etc. But some individual researchers are giving back too. An example for this is the TokTrack dataset, described in an accompanying paper [1] as
Tracking authorship and provenance of Wikipedia article text is by no means a new topic (see e.g. meta:Research:Content persistence). However, the paper's authors assert that their method provides much higher accuracy than earlier efforts such as Wikitrust. One of them, Fabian Flöck, has been studying this problem with other researchers for years (cf. our coverage from 2012 and 2014: "Precise and efficient attribution of authorship of revisioned content", " Better authorship detection, and measuring inequality", " New algorithm provides better revert detection"; the present dataset is generated by their "Wikiwho" algorithm, which also underlies a browser extension called "Whocolor").
What's more, the paper points out that "this data would be exceedingly hard to create by an average potential user" for the entire English Wikipedia due to the computational effort involved ("around 25 days on a dedicated Ubuntu Server [...] with 122 GB RAM and 20 cores"; for comparison, a community-created tool, " WikiBlame", which is linked from every revision history page on English Wikipedia, can take several minutes to find the provenance of an individual token in a single article).
After describing the dataset and the underlying methodology, the paper also briefly presents some insights that can be derived from it about the history of English Wikipedia. First, it looks at the number of added and surviving tokens over time, observing that
It highlights "a surprising spike in Oct. 2002 (also in absolute additions)". Although not mentioned in the paper, this is very likely the effect of bot contributions by User:Ram-Man of US geographical content. Figure 2(b) in the paper also seems to indicate that more than half of these October 2002 additions were still live 14 years later.
Analyzing the "persisting" tokens (that had not been removed within 48 hours) by user group, the authors observe:
The remainder of the paper uses the dataset to study editing controversies. First, the authors define two measures of how controversial an article is, both yielding evolution, Mustafa Kemal Atatürk and Bob Dylan (in that order) as the three most controversial articles as of October 2016 (based on the surviving content at that time only). They also find that "barneys" was the top most conflicted string token.
Lastly, they examine the frequency of edits that undo other edits partially or totally, where the token-based data enables a more sophisticated approach than simpler types of revert analysis. They find that
However, they caution that since "content added by one revision can over (a long) time be corroded by many small changes [...] 'revert' cannot per se be equated with antagonism here, as these numbers include the complete spectrum from minor corrections to full-on opinion clashes and vandal fighting."
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.
Other recent publications that could not be covered in time for this issue include the items listed below. contributions are always welcome for reviewing or summarizing newly published research.
Discuss this story