This is a draft of an article I want to publish soon on Wikipedia. I use this site in my user directory to construct it before I will send the (from my point) final version to review.
Original author(s) | OSR Group |
---|---|
Initial release | May 1, 2011[1] |
Written in | Java |
Operating system | Cross-platform |
Type | Parser |
Website |
sweble |
The Sweble Wikitext parser [3] is a open-source tool to parse the Wikitext markup language used by MediaWiki, the software behind Wikipedia. The initial development was done by Hannes Dohrn as a Ph.D. thesis project at the Open Source Research Group of professor Dirk Riehle at the University of Erlangen-Nuremberg from 2009 until 2011. The results were presented to the public for the first time at the WikiSym conference in 2011 [4].
Based on the statistics at Ohloh [5] the parser is mainly written in the Java programming language. It was open-sourced in May 2011 [1]. The parser itself is generated from a parsing expression grammar (PEG) using the Rats! parser generator. The encoding validation is done using a flex lexical analyser written in JFlex.
A preprint version of the paper on the design of the Sweble Wikitext Parser can be found at the projects homepage. [6] In addition to that, a summary page exists at the MediaWiki's futures. [7]
The parser used in MediaWiki converts the content directly from Wikitext into HTML. This process is done in two stages [8]:
As the authors of Sweble write in their paper [6], an analysis of the source code of MediaWiki's parser showed that the strategy of using separate transformation steps leads to new problems: Most of the used functions do not take the scope of the surrounding elements into account. This consequently leads to wrong nesting in the resulting HTML output. As a result, the evaluation and rendering of the latter can be ambiguous and depends on the rendering engine of the used web browser. They state: "The individual processing steps often lead to unexpected and inconsistent behavior of the parser. For example, lists are recognized inside table cells. However, if the table itself appears inside a framed image, lists are not recognized." [6]
As argued on the WikiSym conference in 2008, a lack of language precision and component decoupling hinders evolution of wiki software. If the wiki content had a well-specified representation that is fully machine processable, this would not only lead to better accessibility of its content but also improve and extend the ways in which it can be processed. [9]
In addition, a well-defined object model for wiki content would allow further tools to operate on it. Until now there have been numerous attempts at implementing a new parser for MediaWiki (see [1]). None of them has succeeded so far. The authors of Sweble state that this might be due to their choice of grammar, namely the well-known LALR(1) and LL(k) grammars. While these grammars are only a subset of context-free grammars, Wikitext requires global parser state and can therefore be considered a context-sensitive language. [6] As a result, they base their parser on a parsing expression grammar (PEG).
Sweble parses the Wikitext and produces an abstract syntax tree as output. This helps to avoid errors from incorrect markup code (e.g. having a link spanning over multiple cells of a table).
The parser processes Wikitext in five stages [6]:
{{
cite journal}}
: CS1 maint: date and year (
link)
This is a draft of an article I want to publish soon on Wikipedia. I use this site in my user directory to construct it before I will send the (from my point) final version to review.
Original author(s) | OSR Group |
---|---|
Initial release | May 1, 2011[1] |
Written in | Java |
Operating system | Cross-platform |
Type | Parser |
Website |
sweble |
The Sweble Wikitext parser [3] is a open-source tool to parse the Wikitext markup language used by MediaWiki, the software behind Wikipedia. The initial development was done by Hannes Dohrn as a Ph.D. thesis project at the Open Source Research Group of professor Dirk Riehle at the University of Erlangen-Nuremberg from 2009 until 2011. The results were presented to the public for the first time at the WikiSym conference in 2011 [4].
Based on the statistics at Ohloh [5] the parser is mainly written in the Java programming language. It was open-sourced in May 2011 [1]. The parser itself is generated from a parsing expression grammar (PEG) using the Rats! parser generator. The encoding validation is done using a flex lexical analyser written in JFlex.
A preprint version of the paper on the design of the Sweble Wikitext Parser can be found at the projects homepage. [6] In addition to that, a summary page exists at the MediaWiki's futures. [7]
The parser used in MediaWiki converts the content directly from Wikitext into HTML. This process is done in two stages [8]:
As the authors of Sweble write in their paper [6], an analysis of the source code of MediaWiki's parser showed that the strategy of using separate transformation steps leads to new problems: Most of the used functions do not take the scope of the surrounding elements into account. This consequently leads to wrong nesting in the resulting HTML output. As a result, the evaluation and rendering of the latter can be ambiguous and depends on the rendering engine of the used web browser. They state: "The individual processing steps often lead to unexpected and inconsistent behavior of the parser. For example, lists are recognized inside table cells. However, if the table itself appears inside a framed image, lists are not recognized." [6]
As argued on the WikiSym conference in 2008, a lack of language precision and component decoupling hinders evolution of wiki software. If the wiki content had a well-specified representation that is fully machine processable, this would not only lead to better accessibility of its content but also improve and extend the ways in which it can be processed. [9]
In addition, a well-defined object model for wiki content would allow further tools to operate on it. Until now there have been numerous attempts at implementing a new parser for MediaWiki (see [1]). None of them has succeeded so far. The authors of Sweble state that this might be due to their choice of grammar, namely the well-known LALR(1) and LL(k) grammars. While these grammars are only a subset of context-free grammars, Wikitext requires global parser state and can therefore be considered a context-sensitive language. [6] As a result, they base their parser on a parsing expression grammar (PEG).
Sweble parses the Wikitext and produces an abstract syntax tree as output. This helps to avoid errors from incorrect markup code (e.g. having a link spanning over multiple cells of a table).
The parser processes Wikitext in five stages [6]:
{{
cite journal}}
: CS1 maint: date and year (
link)