A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A paper in the Berkeley Technology Law Journal [1] finds that the traffic to privacy-sensitive articles on the English Wikipedia dropped significantly around June 2013, when the existence of the US government's PRISM online surveillance program was first revealed based on documents leaked by Edward Snowden. As stated by the author, Jon Penney, the study "is among the first to evidence—using either Wikipedia data or web traffic data more generally—how government surveillance and similar actions may impact online activities, including access to information and knowledge online." It received wide media attention upon its release, as already reported last year in the Signpost.
The paper is part of a growing body of literature that studies the effect of external events on Wikipedia pageviews (for another example, see our previous issue: " How does unemployment affect reading and editing Wikipedia ? The impact of the Great Recession"). The 66-page paper stands out for its methodological diligence, devoting much space to explaining and justifying its data selection and statistical approach, and to checking the robustness of the results. The framework was adapted from an earlier MIT study that had similarly examined the effect of the Snowden revelations on Google search traffic for sensitive terms, finding a statistically significant reduction of 5%. The author emphasizes the higher quality of the Wikipedia data: "unlike Google Trends, the Wikimedia Foundation provides a wealth of data on key elements of its site, including article traffic data, which can provide a more accurate picture as to any impact or chilling effects identified."
To generate a list of Wikipedia articles that could be considered privacy sensitive in the context of US government surveillance, the author used a (publicly available) set of terms that the Department of Homeland Security (DHS) specifies as related to terrorism. The corresponding Wikipedia articles (48 altogether) include dirty bomb, suicide attack, nuclear enrichment (a redirect) and eco-terrorism. To verify the assumption that these topics are indeed considered as privacy sensitive by Internet users, a survey among 415 Mechanical Turk users asked them to rate each, e.g. on whether they would be likely to delete their browser history after accessing information about it.
To examine the impact on traffic, the paper uses the time series of monthly pageviews for the 48 articles (81 million views altogether, from January 2012 to August 2014). It is divided into the periods before and after the June 2013 "exogeneous shock". As a first finding, the author notes that the average monthly views in the "after" period are lower - but points out that such considerations (which e.g. form part of the difference in differences approach in the paper on unemployment mentioned above) are too simplistic to show an actual effect, e.g. because this could merely be caused by an overall declining traffic trend. (Although not stated directly in the paper, this is indeed the case, as the study is only based on desktop pageviews, which have been gradually replaced by mobile views in recent years. The Wikimedia Foundation makes combined mobile/desktop pageview datasets available going back to 2015.)
The author then turns to a more sophisticated statistical method known as interrupted time series analysis (ITS). It involves a "segmented regression analysis": linear trend lines are calculated separately for the timespans before and after June 2013, providing information both on the slope (growth/decrease rate) within each and on the size of the mismatch (if any) where the two segments intersect. This method indicates "an immediate drop-off of over 30% of overall views" following the June 2013 revelations. To further exclude the possibility that the results for these terrorism-related articles "may simply reflect overall Wikipedia article view traffic trends", an analogous ITS analysis is conducted for the pageviews to all Wikipedia articles.
The author points out the importance of the results for the Wikimedia Foundation's current lawsuit that challenges the constitutionality of the NSA surveillance of Internet traffic.
See also our review of a recent qualitative study that examined the privacy concerns of editors: "Privacy, anonymity, and perceived risk in open collaboration: a study of Tor users and Wikipedians"
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions are always welcome for reviewing or summarizing newly published research.
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A paper in the Berkeley Technology Law Journal [1] finds that the traffic to privacy-sensitive articles on the English Wikipedia dropped significantly around June 2013, when the existence of the US government's PRISM online surveillance program was first revealed based on documents leaked by Edward Snowden. As stated by the author, Jon Penney, the study "is among the first to evidence—using either Wikipedia data or web traffic data more generally—how government surveillance and similar actions may impact online activities, including access to information and knowledge online." It received wide media attention upon its release, as already reported last year in the Signpost.
The paper is part of a growing body of literature that studies the effect of external events on Wikipedia pageviews (for another example, see our previous issue: " How does unemployment affect reading and editing Wikipedia ? The impact of the Great Recession"). The 66-page paper stands out for its methodological diligence, devoting much space to explaining and justifying its data selection and statistical approach, and to checking the robustness of the results. The framework was adapted from an earlier MIT study that had similarly examined the effect of the Snowden revelations on Google search traffic for sensitive terms, finding a statistically significant reduction of 5%. The author emphasizes the higher quality of the Wikipedia data: "unlike Google Trends, the Wikimedia Foundation provides a wealth of data on key elements of its site, including article traffic data, which can provide a more accurate picture as to any impact or chilling effects identified."
To generate a list of Wikipedia articles that could be considered privacy sensitive in the context of US government surveillance, the author used a (publicly available) set of terms that the Department of Homeland Security (DHS) specifies as related to terrorism. The corresponding Wikipedia articles (48 altogether) include dirty bomb, suicide attack, nuclear enrichment (a redirect) and eco-terrorism. To verify the assumption that these topics are indeed considered as privacy sensitive by Internet users, a survey among 415 Mechanical Turk users asked them to rate each, e.g. on whether they would be likely to delete their browser history after accessing information about it.
To examine the impact on traffic, the paper uses the time series of monthly pageviews for the 48 articles (81 million views altogether, from January 2012 to August 2014). It is divided into the periods before and after the June 2013 "exogeneous shock". As a first finding, the author notes that the average monthly views in the "after" period are lower - but points out that such considerations (which e.g. form part of the difference in differences approach in the paper on unemployment mentioned above) are too simplistic to show an actual effect, e.g. because this could merely be caused by an overall declining traffic trend. (Although not stated directly in the paper, this is indeed the case, as the study is only based on desktop pageviews, which have been gradually replaced by mobile views in recent years. The Wikimedia Foundation makes combined mobile/desktop pageview datasets available going back to 2015.)
The author then turns to a more sophisticated statistical method known as interrupted time series analysis (ITS). It involves a "segmented regression analysis": linear trend lines are calculated separately for the timespans before and after June 2013, providing information both on the slope (growth/decrease rate) within each and on the size of the mismatch (if any) where the two segments intersect. This method indicates "an immediate drop-off of over 30% of overall views" following the June 2013 revelations. To further exclude the possibility that the results for these terrorism-related articles "may simply reflect overall Wikipedia article view traffic trends", an analogous ITS analysis is conducted for the pageviews to all Wikipedia articles.
The author points out the importance of the results for the Wikimedia Foundation's current lawsuit that challenges the constitutionality of the NSA surveillance of Internet traffic.
See also our review of a recent qualitative study that examined the privacy concerns of editors: "Privacy, anonymity, and perceived risk in open collaboration: a study of Tor users and Wikipedians"
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions are always welcome for reviewing or summarizing newly published research.
Discuss this story