A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Finn Årup Nielsen, Michael Etter and Lars Kai Hansen presented a technical report [1] on an online service which they created to conduct real-time monitoring of Wikipedia articles of companies. It performs sentiment analysis of edits, filtered by companies and editors. Sentiment analysis is a new applied linguistics technology which is being used in a number of tasks ranging from author profiling to detecting fake reviews on online retailers. The form of visualization provided by this tool can easily detect deviation from linguistic neutrality. However, as the authors point out, this analysis only gives a robust picture when used statistically and is more prone to mistakes when operating within a limited scope.
The service monitors recent changes using an IRC stream and detects company-related articles from a small hand-built list. It then retrieves the current version using the MediaWiki API and performs sentiment analysis using the AFINN sentiment-annotated word list. The project was developed by integrating a number of open source components such as NLTK and CouchDB. Unfortunately, the source code has not been made available and the service can only run queries on the shortlisted companies which will limit the impact of this report on future Wikipedia research. However, it seems to have potential as a tool for detecting COI edits that tend to tip neutrality by adding excess praise or attacks which tip the content in the other direction. We hope the researchers will open-source this tool like their prior work on the AFINN data-set, or at least provide some UI to query articles not included in the original research.
A paper [2] with this title investigates the relation between the scientific reputation of scientific items (authors, papers, and keywords) and the impact of the same items on Wikipedia articles. The sample of scientific items is made of the entries in the ACM digital library including more than 100 k papers, 150 k authors and 35 k keywords. However, only a tiny subset of these could be found in English Wikipedia pages (the authors considered all Wikipedia pages in the English edition which contain at least two mentions of any of the scientific items in the sample). The academic reputation is calculated based on three criteria: frequency of appearance, number of citations each item receives from the others, and PageRank calculated on the citation network. The Wikipedia ranking is based on three popularity measures of all the pages that have mentioned the item: number of mentions, sum over PageRank of all the mentioning pages, and sum over in-degrees of all the mentioning pages in Wikipedia's hyperlink network.
These 3 times 3 choices give 9 combinations of academic ranking and Wikipedia ranking for 3 types of scientific entities (authors, papers, keywords). All these 27 pairs are shown to be correlated according to Spearman's Rank Correlation, indicating that in general Wikipedia mentions are non-randomly driven by scientific reputation. However, most of the combinations are less significant. Surprisingly, the most relevant Wikipedia ranking criterion turns out to be the pure total number of mentions, compared to the more sophisticated ones, i.e., PageRank and in-degree measures.
In a separate part, authors define two sets of scientific items, those which are mentioned in Wikipedia, and those which are not mentioned at all (the latter is larger in size by a factor of 2 for keywords, 100 for authors, and 300 for papers). They show that for all 3 types, the set of items which are mentioned in Wikipedia have a better academic rank on average.
From this, the author concludes that "the Wikipedia community is unconsciously mimicking the general historiography of the country", in particular a glorification of Angkor and other early kingdoms at the cost of later periods, and observes a "continuing dominance of the traditional historiographical narrative of Cambodian history in Wikipedia." The subsequent section of the paper tries to put these results into the context of the historical debates in the late 1970s and early 1980s about the New World Information and Communication Order (NWICO), a suggested remedy for problems with the under-representation of the developing world in the media, put forth by a UNESCO commission in the MacBride report (1980):The early history of Cambodia is represented by an extremely weak article, but there is an improvement in the articles dealing with the early kingdoms of Cambodia. The improvement ends abruptly with articles on the 'dark age' of Cambodia, the French Protectorate, the Japanese occupation, and early postindependence periods being of a much lower quality. Afterward, the quality picks up again with especially good articles on the American intervention in Cambodia, the Cambodian-Vietnamese War, and the People's Republic of Kampuchea. However, the quality does not last; as we near contemporary times, the articles take another turn for the worse.
The author's argument is somewhat weakened by asserting erroneously that "there exists no Cambodian-language Wikipedia", but generally aligns with other quantitative research that has found a geographic unevenness of coverage in Wikipedia. The author is an information studies professor at Singapore's Nanyang Technological University and previously published a related paper in the same journal examining the Wikipedia article History of the Philippines, reviewed in the August issue: " The limits of amateur NPOV history".Wikipedia provides access—it is free to use by anyone with an Internet connection, and print versions can also be distributed. But the whole thrust of the NWICO argument is that content matters and those who create content matter perhaps even more, with the commission stressing that countries needed to 'achieve self-reliance in communication capacities and policies' ... Contrary to popular belief, in the new 'information age' content is, once again, the preserve of the few, not the many, and a geographically concentrated few at that.
Julia Preusse, Jerome Kunegis, Matthias Thimm, Thomas Gottron and Steffen Staab investigate [4] mechanisms of changes in a wiki that are of structural nature, i.e., which are a direct result of the wiki's linking structure. They consider if the addition and removal of internal links between pages can be predicted using just information about the network connecting these articles. The study's innovation lies in considering the removal of links, which account for a high proportion of removals and reverts. The authors performed an empirical study on Wikipedia, stating that traditional indicators of structural change used in the link analysis literature can be classified into four classes, which indicate growth, decay, stability and instability of links. These methods were then employed to identify the underlying reasons for individual additions and removals of knowledge links.
The network created by links between articles in Wikipedia is characterized by preferential attachment. Prior work on social networks has identified a phenomenon called "liability of newness", in which new connections are more likely to be broken than older ones. To provide a better predictive model of link evolution the team considered five hypotheses:
To test these hypotheses, they created networks based on the history of the mainspace articles till 2011 of the top five Wikipedias after the English one. For example, in the French Wikipedia, 41.7 million links were added and 17.3 million removed during that time. The data was used to create a link creation predictor and a link removal predictor. These were then evaluated using the area under the receiver operating characteristic curve.
The results were that Preferential attachment and Embedding are good indicators of growth. Liability of Newness did not turn out to be a good indicator of link removal, but more of article instability. Reciprocity is also an indicator of growth, but is not as significant since most links in a wiki are not reciprocated.
An article [5] in the Journal of Information Science, titled "Understanding trust formation in digital information sources: The case of Wikipedia", explores the criteria used by students to evaluate the credibility of Wikipedia articles. It contains an overview of various earlier studies about credibility judgments of Wikipedia articles (some of them reviewed previously in this space, example: " Quality of featured articles doesn't always impress readers").
The authors asked "20 second-year undergraduate students and 30 Master’s students" in information studies to first spend 20 minutes reading "a copy of a two-page Wikipedia article on Generation Z, a topic with which students were expected to have some familiarity", and answer an open-ended question explaining how they would judge its trustworthiness. In a subsequent part, the respondents were asked to rank a list of factors for trustworthiness in case of "either (a) the topic of an assignment, or (b) a minor medical condition from which they were suffering". One of the first findings was a "low pre-disposition to use [Wikipedia], possibly suggesting a propensity to distrust, grounded on debates and comments on the trustworthiness of Wikipedia" – possibly to the fact that the example article contained an example of vandalism, a fact highlighted by several respondents (e.g. "started off as a valid entry ... due to citations strengthening this ... however came to the last paragraph and the whole document was marred by the insert of 'writing articles on Wikipedia while on amphetamines' [as purported hobby of Generation Z members]... just feels that you can't trust anything now").
Among the given trustworthiness factors, the following were ranked most highly:authorship, currency, references, expert recommendation and triangulation/verification, with usefulness just below this threshold. In other words, participants valued having articles that were written by experts on the subject, that were up to date, and that they perceived to be useful (content factors). ... Interestingly these factors all seemed more or less equally important for both contexts, with the exception of references, which for predictable reasons were seen as having greater importance in the context of assignments.
In a conference paper titled "Analyzing the flow of ideas and profiles of contributors in an open learning community" [6] (see also audience notes from the presentation), the authors construct a graph from the set of revisions of a set of Wikiversity pages, with two kind of edges: 1) "Update edges", linking a page's revision to the directly subsequent revision. These are understood as representing "knowledge flow over the course of the collaborative process on a single wiki page". 2) "Hyperlink edges" between two revisions of different pages with a wikilink between them - but pointing in the opposite direction, because the idea is that they indicate knowledge flowing from the linked page to the linking page. By requiring the source node of a hyperlink edge "as the latest revision of the hyperlinked page at the moment of creation of the target revision", both kinds of links point forward in time, resulting in a two-relational directed acyclic graph (DAG), which is "depicting the knowledge flow over time." After filtering out "redundant" hyperlink edges and attaching authorship information to each node (page revision).
The authors apply this procedure to a set of Wikiversity articles in the area of medicine, starting with v:Gynecological History Taking. The results are interpreted as follows:The method is subsequently applied to profile the activities of various users.the beginning, short after the category medicine was founded, the authors in this category built up the basic structure of the knowledge domain. The main relations and idea flows between the learning materials were established early in the development of the domain. After that the authors have been focusing on elaborating the articles without introducing new important hyperlinks. The overall picture of the learning process in this domain suggests a divergent evolution of ideas after an initial period of mutual fertilization between different topics. This conforms to the idea of groups of learners that followed different interests in the medicine domain with little inter-group collaboration on the creation of new shared learning resource.
The authors have integrated these algorithms, including visualization tools, into a "network analytics workbench ... used in the ongoing EU project SISOB which aims to measure the influence of science on society based on the analysis of (social) networks of researchers and created artifacts."
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Finn Årup Nielsen, Michael Etter and Lars Kai Hansen presented a technical report [1] on an online service which they created to conduct real-time monitoring of Wikipedia articles of companies. It performs sentiment analysis of edits, filtered by companies and editors. Sentiment analysis is a new applied linguistics technology which is being used in a number of tasks ranging from author profiling to detecting fake reviews on online retailers. The form of visualization provided by this tool can easily detect deviation from linguistic neutrality. However, as the authors point out, this analysis only gives a robust picture when used statistically and is more prone to mistakes when operating within a limited scope.
The service monitors recent changes using an IRC stream and detects company-related articles from a small hand-built list. It then retrieves the current version using the MediaWiki API and performs sentiment analysis using the AFINN sentiment-annotated word list. The project was developed by integrating a number of open source components such as NLTK and CouchDB. Unfortunately, the source code has not been made available and the service can only run queries on the shortlisted companies which will limit the impact of this report on future Wikipedia research. However, it seems to have potential as a tool for detecting COI edits that tend to tip neutrality by adding excess praise or attacks which tip the content in the other direction. We hope the researchers will open-source this tool like their prior work on the AFINN data-set, or at least provide some UI to query articles not included in the original research.
A paper [2] with this title investigates the relation between the scientific reputation of scientific items (authors, papers, and keywords) and the impact of the same items on Wikipedia articles. The sample of scientific items is made of the entries in the ACM digital library including more than 100 k papers, 150 k authors and 35 k keywords. However, only a tiny subset of these could be found in English Wikipedia pages (the authors considered all Wikipedia pages in the English edition which contain at least two mentions of any of the scientific items in the sample). The academic reputation is calculated based on three criteria: frequency of appearance, number of citations each item receives from the others, and PageRank calculated on the citation network. The Wikipedia ranking is based on three popularity measures of all the pages that have mentioned the item: number of mentions, sum over PageRank of all the mentioning pages, and sum over in-degrees of all the mentioning pages in Wikipedia's hyperlink network.
These 3 times 3 choices give 9 combinations of academic ranking and Wikipedia ranking for 3 types of scientific entities (authors, papers, keywords). All these 27 pairs are shown to be correlated according to Spearman's Rank Correlation, indicating that in general Wikipedia mentions are non-randomly driven by scientific reputation. However, most of the combinations are less significant. Surprisingly, the most relevant Wikipedia ranking criterion turns out to be the pure total number of mentions, compared to the more sophisticated ones, i.e., PageRank and in-degree measures.
In a separate part, authors define two sets of scientific items, those which are mentioned in Wikipedia, and those which are not mentioned at all (the latter is larger in size by a factor of 2 for keywords, 100 for authors, and 300 for papers). They show that for all 3 types, the set of items which are mentioned in Wikipedia have a better academic rank on average.
From this, the author concludes that "the Wikipedia community is unconsciously mimicking the general historiography of the country", in particular a glorification of Angkor and other early kingdoms at the cost of later periods, and observes a "continuing dominance of the traditional historiographical narrative of Cambodian history in Wikipedia." The subsequent section of the paper tries to put these results into the context of the historical debates in the late 1970s and early 1980s about the New World Information and Communication Order (NWICO), a suggested remedy for problems with the under-representation of the developing world in the media, put forth by a UNESCO commission in the MacBride report (1980):The early history of Cambodia is represented by an extremely weak article, but there is an improvement in the articles dealing with the early kingdoms of Cambodia. The improvement ends abruptly with articles on the 'dark age' of Cambodia, the French Protectorate, the Japanese occupation, and early postindependence periods being of a much lower quality. Afterward, the quality picks up again with especially good articles on the American intervention in Cambodia, the Cambodian-Vietnamese War, and the People's Republic of Kampuchea. However, the quality does not last; as we near contemporary times, the articles take another turn for the worse.
The author's argument is somewhat weakened by asserting erroneously that "there exists no Cambodian-language Wikipedia", but generally aligns with other quantitative research that has found a geographic unevenness of coverage in Wikipedia. The author is an information studies professor at Singapore's Nanyang Technological University and previously published a related paper in the same journal examining the Wikipedia article History of the Philippines, reviewed in the August issue: " The limits of amateur NPOV history".Wikipedia provides access—it is free to use by anyone with an Internet connection, and print versions can also be distributed. But the whole thrust of the NWICO argument is that content matters and those who create content matter perhaps even more, with the commission stressing that countries needed to 'achieve self-reliance in communication capacities and policies' ... Contrary to popular belief, in the new 'information age' content is, once again, the preserve of the few, not the many, and a geographically concentrated few at that.
Julia Preusse, Jerome Kunegis, Matthias Thimm, Thomas Gottron and Steffen Staab investigate [4] mechanisms of changes in a wiki that are of structural nature, i.e., which are a direct result of the wiki's linking structure. They consider if the addition and removal of internal links between pages can be predicted using just information about the network connecting these articles. The study's innovation lies in considering the removal of links, which account for a high proportion of removals and reverts. The authors performed an empirical study on Wikipedia, stating that traditional indicators of structural change used in the link analysis literature can be classified into four classes, which indicate growth, decay, stability and instability of links. These methods were then employed to identify the underlying reasons for individual additions and removals of knowledge links.
The network created by links between articles in Wikipedia is characterized by preferential attachment. Prior work on social networks has identified a phenomenon called "liability of newness", in which new connections are more likely to be broken than older ones. To provide a better predictive model of link evolution the team considered five hypotheses:
To test these hypotheses, they created networks based on the history of the mainspace articles till 2011 of the top five Wikipedias after the English one. For example, in the French Wikipedia, 41.7 million links were added and 17.3 million removed during that time. The data was used to create a link creation predictor and a link removal predictor. These were then evaluated using the area under the receiver operating characteristic curve.
The results were that Preferential attachment and Embedding are good indicators of growth. Liability of Newness did not turn out to be a good indicator of link removal, but more of article instability. Reciprocity is also an indicator of growth, but is not as significant since most links in a wiki are not reciprocated.
An article [5] in the Journal of Information Science, titled "Understanding trust formation in digital information sources: The case of Wikipedia", explores the criteria used by students to evaluate the credibility of Wikipedia articles. It contains an overview of various earlier studies about credibility judgments of Wikipedia articles (some of them reviewed previously in this space, example: " Quality of featured articles doesn't always impress readers").
The authors asked "20 second-year undergraduate students and 30 Master’s students" in information studies to first spend 20 minutes reading "a copy of a two-page Wikipedia article on Generation Z, a topic with which students were expected to have some familiarity", and answer an open-ended question explaining how they would judge its trustworthiness. In a subsequent part, the respondents were asked to rank a list of factors for trustworthiness in case of "either (a) the topic of an assignment, or (b) a minor medical condition from which they were suffering". One of the first findings was a "low pre-disposition to use [Wikipedia], possibly suggesting a propensity to distrust, grounded on debates and comments on the trustworthiness of Wikipedia" – possibly to the fact that the example article contained an example of vandalism, a fact highlighted by several respondents (e.g. "started off as a valid entry ... due to citations strengthening this ... however came to the last paragraph and the whole document was marred by the insert of 'writing articles on Wikipedia while on amphetamines' [as purported hobby of Generation Z members]... just feels that you can't trust anything now").
Among the given trustworthiness factors, the following were ranked most highly:authorship, currency, references, expert recommendation and triangulation/verification, with usefulness just below this threshold. In other words, participants valued having articles that were written by experts on the subject, that were up to date, and that they perceived to be useful (content factors). ... Interestingly these factors all seemed more or less equally important for both contexts, with the exception of references, which for predictable reasons were seen as having greater importance in the context of assignments.
In a conference paper titled "Analyzing the flow of ideas and profiles of contributors in an open learning community" [6] (see also audience notes from the presentation), the authors construct a graph from the set of revisions of a set of Wikiversity pages, with two kind of edges: 1) "Update edges", linking a page's revision to the directly subsequent revision. These are understood as representing "knowledge flow over the course of the collaborative process on a single wiki page". 2) "Hyperlink edges" between two revisions of different pages with a wikilink between them - but pointing in the opposite direction, because the idea is that they indicate knowledge flowing from the linked page to the linking page. By requiring the source node of a hyperlink edge "as the latest revision of the hyperlinked page at the moment of creation of the target revision", both kinds of links point forward in time, resulting in a two-relational directed acyclic graph (DAG), which is "depicting the knowledge flow over time." After filtering out "redundant" hyperlink edges and attaching authorship information to each node (page revision).
The authors apply this procedure to a set of Wikiversity articles in the area of medicine, starting with v:Gynecological History Taking. The results are interpreted as follows:The method is subsequently applied to profile the activities of various users.the beginning, short after the category medicine was founded, the authors in this category built up the basic structure of the knowledge domain. The main relations and idea flows between the learning materials were established early in the development of the domain. After that the authors have been focusing on elaborating the articles without introducing new important hyperlinks. The overall picture of the learning process in this domain suggests a divergent evolution of ideas after an initial period of mutual fertilization between different topics. This conforms to the idea of groups of learners that followed different interests in the medicine domain with little inter-group collaboration on the creation of new shared learning resource.
The authors have integrated these algorithms, including visualization tools, into a "network analytics workbench ... used in the ongoing EU project SISOB which aims to measure the influence of science on society based on the analysis of (social) networks of researchers and created artifacts."
Discuss this story
Font size
According to the above, four researchers from Barcelona "recommend using 18-point font size when designing web text for readers with dyslexia". Can't dyslexics "zoom" their displays, like the rest of us do? -- Orlady ( talk) 15:08, 2 May 2013 (UTC) reply
Opinionated
As usual, good work summarizing briefly a number of interesting activities. However, a minor grammar error brought me to a hightened state of alert, which made me notice the poorer quality of the next "Mining content removed" item. It's too long for an "in brief" bullet point, because the reviewer spends too many words pointing out what's wrong with the reviewed work. Other than that, it's a well written page, rewarding the usual wait for our overdue Signpost. Jim.henderson ( talk) 12:25, 3 May 2013 (UTC) reply
Usability study
To increase redability, I recommend a line width of 120 characters. You can do that with this code, copy it on your userspace. Have fun! -- NaBUru38 ( talk) 17:32, 3 May 2013 (UTC) reply
Provenance
The provenance of Wikipedia articles is per-character, but the W3C PROV descriptor is per-document, isn't it? 116.233.70.143 ( talk) 01:00, 5 May 2013 (UTC) reply