A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Wikipedia pages are edited with varying levels of consistency: stubs may only have a dozen or fewer revisions and controversial topics might have more than 10,000 revisions. However, this editing activity is not evenly spaced out over time either: some revisions occur in very quick succession while other revisions might persist for weeks or months before another change is made. Many social and technical systems exhibit " bursty" qualities of intensive activity separated by long periods of inactivity. In a pre-print submitted to arXiv, a team of physicists at the Belgian Université de Namur and Portuguese University of Coimbra examine this phenomenon of "burstiness" in editing activity on the English Wikipedia. [1]
The authors use a database dump containing the revision history until January 2010 of 4.6 million English Wikipedia pages. Filtering out pages and editors with fewer than 2000 revisions, bots, and edits from unregistered accounts, the paper adopts some previously-defined measures of burstiness and cyclicality in these editing patterns. The measures of editors' revisions' burstiness and memory fall outside of the limits found in prior work about human dynamics, suggesting different mechanisms are at work on Wikipedia editing than in mobile phone communication, for example.
Using a fast Fourier transform, the paper finds the 100 most active editors have signals occurring at a 24-hour frequency (and associated harmonics) indicating they follow a circadian pattern of revising daily as well as differences by day of week and hour of day. However, the 100 most-revised pages lack a similar peak in the power spectrum: there is no characteristic hourly, daily, weekly, etc. revision pattern. Despite these circadian patterns, editors' revision histories still show bursty patterns with long-tailed inter-event times across different time windows.
The paper concludes by arguing, "before performing an action, we must overcome a “barrier”, acting as a cost, which depends, among many other things, on the time of day. However, once that “barrier” has been crossed, the time taken by that activity no longer depends on the time of day at which we decided to perform it. ... It could be related to some sort of queuing process, but we prefer to see it as due to resource allocation (attention, time, energy), which exhibits a broad distribution: shorter activities are more likely to be executed next than the longer ones."
Google Trends is widely used in academic research to model the relationship between information seeking and other social and behavioral phenomenon. However, Wikipedia pageview data can provide a superior – if underused – alternative that has attracted some attention for public health and economic modeling, but not to the same extent as Google Trends. The authors cite the relative openness of Wikipedia pageview data, the semantic disambiguation, and absolute counts of activity in contrast to Google Trends' closed API, semantic ambiguity of keywords, and relative query share data. However, Trends data (at a weekly level) does go back to 2004, while pageview data (at an hourly level) is only available from 2008.
In a peer-reviewed paper published by PLoS ONE, a team of physicists perform a variety of time series analyses to evaluate changes in attention around the "big data" topic of Hadoop. [2] Defining two key constructs of relevance and representation based on the interlanguage links as well as hyperlinks to/from other concepts, they examine changes in these features over time. In particular, changes in the articles' content and attention occurred in concert with the release of new versions and the adoption of the technology by new firms.
The time series analyses (and terms used to refer to them) will be difficult for non-statisticians to follow, but the paper makes several promising contributions. First, it provides a number of good critiques of research relying exclusive on Google Trends data (outlined above). Second, it provides some methods for incorporating behavioral data from strongly related topics and examining these changes over time in a principled manner. Third, the paper examines behavior across multiple languages editions rather than focusing solely on the English Wikipedia. The paper points to ways in which Wikipedia is an important information sources for tracking publication and recognition of new topics.
This paper [3] data mines Wikipedia's biographies, focusing on individuals' longevity, profession and cause of death. The authors are not the first to observe that the majority of Wikipedia biographies are about sportspeople (half of them soccer players), followed by artists and politicians. But they do make some interesting historical observations, such as that the sport rises only in the 20th century (particularly from the 1990s), that politics surpassed religion in the 13th century, until it was surpassed by sport, and so on. The authors divide the biographies into public (politicians, businessmen, religion) and private (artists and sportspeople) and note that it was only in the last few decades that the second group started to significantly outnumber the first; they conclude that this represents a major shift in societal values, which they refer to as "hidden revolution in human priorities". It is an interesting argument, though the paper is unfortunately completely missing the discussion of some important topics, such as the possible bias introduced by Wikipedia's notability policies.
This paper [4] looks into gender inequalities in Wikipedia articles, presenting a computational method for assessing gender bias in Wikipedia along several dimensions. It touches on a number of interesting questions, such as whether the same rules are used to determine whether women and men are notable; whether there is linguistic bias, and whether articles about men and women have similar structural properties (e. g., similar meta-data, and network properties in the hyperlink network).
They conclude that notability guidelines seem to be more strictly enforced for women than for men, that linguistic bias exists (ex. one of the four words most strongly associated with female biographies is "husband", whereas such family-oriented words are much less likely to be found in biographies of male subjects), and that as the majority of biographies are about men and men tend to link more to men than to women, this lowers visibility of female biographies (for example, in search engines like Google). The authors suggest that Wikipedia community should consider lowering notability requirements for women (controversial), and adding gender-neutral language requirements to the Manual of Style (a much more sensible proposal).
A survey [5] of 372 anesthesists and critical care providers in Austria and Australia found that "In order to get a fast overview about a medical problem, physicians would prefer Google (32%) over Wikipedia (19%), UpToDate (18%), or PubMed (17%). 39% would, at least sometimes, base their medical decisions on non peer-reviewed resources. Wikipedia is used often or sometimes by 77% of the interns, 74% of residents, and 65% of consultants to get a fast overview of a medical problem. Consulting Wikipedia or Google first in order to get more information about the pathophysiology, drug dosage, or diagnostic options in a rare medical condition was the choice of 66%, 10% or 34%, respectively." (A 2012 literature review found that "Wikipedia is widely used as a reference tool" among clinicians.)
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Wikipedia pages are edited with varying levels of consistency: stubs may only have a dozen or fewer revisions and controversial topics might have more than 10,000 revisions. However, this editing activity is not evenly spaced out over time either: some revisions occur in very quick succession while other revisions might persist for weeks or months before another change is made. Many social and technical systems exhibit " bursty" qualities of intensive activity separated by long periods of inactivity. In a pre-print submitted to arXiv, a team of physicists at the Belgian Université de Namur and Portuguese University of Coimbra examine this phenomenon of "burstiness" in editing activity on the English Wikipedia. [1]
The authors use a database dump containing the revision history until January 2010 of 4.6 million English Wikipedia pages. Filtering out pages and editors with fewer than 2000 revisions, bots, and edits from unregistered accounts, the paper adopts some previously-defined measures of burstiness and cyclicality in these editing patterns. The measures of editors' revisions' burstiness and memory fall outside of the limits found in prior work about human dynamics, suggesting different mechanisms are at work on Wikipedia editing than in mobile phone communication, for example.
Using a fast Fourier transform, the paper finds the 100 most active editors have signals occurring at a 24-hour frequency (and associated harmonics) indicating they follow a circadian pattern of revising daily as well as differences by day of week and hour of day. However, the 100 most-revised pages lack a similar peak in the power spectrum: there is no characteristic hourly, daily, weekly, etc. revision pattern. Despite these circadian patterns, editors' revision histories still show bursty patterns with long-tailed inter-event times across different time windows.
The paper concludes by arguing, "before performing an action, we must overcome a “barrier”, acting as a cost, which depends, among many other things, on the time of day. However, once that “barrier” has been crossed, the time taken by that activity no longer depends on the time of day at which we decided to perform it. ... It could be related to some sort of queuing process, but we prefer to see it as due to resource allocation (attention, time, energy), which exhibits a broad distribution: shorter activities are more likely to be executed next than the longer ones."
Google Trends is widely used in academic research to model the relationship between information seeking and other social and behavioral phenomenon. However, Wikipedia pageview data can provide a superior – if underused – alternative that has attracted some attention for public health and economic modeling, but not to the same extent as Google Trends. The authors cite the relative openness of Wikipedia pageview data, the semantic disambiguation, and absolute counts of activity in contrast to Google Trends' closed API, semantic ambiguity of keywords, and relative query share data. However, Trends data (at a weekly level) does go back to 2004, while pageview data (at an hourly level) is only available from 2008.
In a peer-reviewed paper published by PLoS ONE, a team of physicists perform a variety of time series analyses to evaluate changes in attention around the "big data" topic of Hadoop. [2] Defining two key constructs of relevance and representation based on the interlanguage links as well as hyperlinks to/from other concepts, they examine changes in these features over time. In particular, changes in the articles' content and attention occurred in concert with the release of new versions and the adoption of the technology by new firms.
The time series analyses (and terms used to refer to them) will be difficult for non-statisticians to follow, but the paper makes several promising contributions. First, it provides a number of good critiques of research relying exclusive on Google Trends data (outlined above). Second, it provides some methods for incorporating behavioral data from strongly related topics and examining these changes over time in a principled manner. Third, the paper examines behavior across multiple languages editions rather than focusing solely on the English Wikipedia. The paper points to ways in which Wikipedia is an important information sources for tracking publication and recognition of new topics.
This paper [3] data mines Wikipedia's biographies, focusing on individuals' longevity, profession and cause of death. The authors are not the first to observe that the majority of Wikipedia biographies are about sportspeople (half of them soccer players), followed by artists and politicians. But they do make some interesting historical observations, such as that the sport rises only in the 20th century (particularly from the 1990s), that politics surpassed religion in the 13th century, until it was surpassed by sport, and so on. The authors divide the biographies into public (politicians, businessmen, religion) and private (artists and sportspeople) and note that it was only in the last few decades that the second group started to significantly outnumber the first; they conclude that this represents a major shift in societal values, which they refer to as "hidden revolution in human priorities". It is an interesting argument, though the paper is unfortunately completely missing the discussion of some important topics, such as the possible bias introduced by Wikipedia's notability policies.
This paper [4] looks into gender inequalities in Wikipedia articles, presenting a computational method for assessing gender bias in Wikipedia along several dimensions. It touches on a number of interesting questions, such as whether the same rules are used to determine whether women and men are notable; whether there is linguistic bias, and whether articles about men and women have similar structural properties (e. g., similar meta-data, and network properties in the hyperlink network).
They conclude that notability guidelines seem to be more strictly enforced for women than for men, that linguistic bias exists (ex. one of the four words most strongly associated with female biographies is "husband", whereas such family-oriented words are much less likely to be found in biographies of male subjects), and that as the majority of biographies are about men and men tend to link more to men than to women, this lowers visibility of female biographies (for example, in search engines like Google). The authors suggest that Wikipedia community should consider lowering notability requirements for women (controversial), and adding gender-neutral language requirements to the Manual of Style (a much more sensible proposal).
A survey [5] of 372 anesthesists and critical care providers in Austria and Australia found that "In order to get a fast overview about a medical problem, physicians would prefer Google (32%) over Wikipedia (19%), UpToDate (18%), or PubMed (17%). 39% would, at least sometimes, base their medical decisions on non peer-reviewed resources. Wikipedia is used often or sometimes by 77% of the interns, 74% of residents, and 65% of consultants to get a fast overview of a medical problem. Consulting Wikipedia or Google first in order to get more information about the pathophysiology, drug dosage, or diagnostic options in a rare medical condition was the choice of 66%, 10% or 34%, respectively." (A 2012 literature review found that "Wikipedia is widely used as a reference tool" among clinicians.)
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.
Discuss this story