From Wikipedia, the free encyclopedia

This Ruby program has two modes. It can run as a daemon or text processor (daemon mode is preferred, since it's more efficient).

In text-scanning mode, it interprets its command line (or stdin if no command line given) as text possibly containing [[wikilinks]]. It preserves the original text and adds a text hyperlink (the http: address contained in <> braces).

In daemon mode, it receives HTTP requests like http://localhost:4242/mwlink?page=wiki-page-name and redirects to the appropriate Wikimedia page. It's convenient for scripts to just use that URL rather than constructing one themselves--all they have to do is URL-escape the text between [[ and ]].

   #!/usr/bin/ruby

   # This script is dual-licensed under the GPL version 2 or any later
   # version, at your option. See http://www.gnu.org/licenses/gpl.txt for more
   # details.

   =begin

   = NAME

   mwlink - Linkify mediawiki-style wikilinks in plain text

   = SYNOPSIS

      mwlink options text-to-wikilink
         --daemon[=port     Run as HTTP daemon
         --encoding          Default character set encoding (utf-8)
         --default-wiki      Default wiki (wikipedia)
         --default-language  Default language (en)

   = DESCRIPTION

   In text-scanning mode (without the --daemon argument) The mwlink program scans
   its arguments (or its standard input, in the event of no arguments) for
   wikilinks of the form [[link]]. It expands such links into URLs and inserts
   them into the original text after the [[link]] in sharp braces ((({<})) and
   (({>}))). Options are provided for specifying a default wiki (the wiki to link
   to if no qualifier is given in the link) and a default language (the language
   to assume if no qualifier is given) as well as the character set encoding in
   use. The built-in defaults are ((*wikipedia*)), ((*en*)) and ((*utf-8*)),
   respectively.

   In daemon mode (now preferred), It receives HTTP requests of the form
   "http://.../page=((*wikipedia page*))" (the ((*wikipedia page*)) name is what
   would appear within a [[wikilink]]. URL-escaping is required but no other
   processing, making it convenient to use from scripts.

   == Initialization File

   The names of namespaces vary in different languages (especially due to
   language. For example, "User:" in English is "Benutzer:" in German. You can
   specify lists of namespaces to use for particular languages in an
   initialization file (({~/.mwlinkrc})). This is simply a line with the
   language, a colon, and a space-separated list of namespaces in that
   language. When interpreting links for that language (either because
   ((*--default-language*)) was specified or there is a language qualifier in
   the link, mwlink will recognize it as a namespace appropriately. All the
   namespaces must appear on one line--line continuation is not supported.

   Comments (lines introduced with (({#}})) (pound sign)) are comments, and
   are ignored, along with blank lines.

   Here is an example configuration containing (only) some namespaces from the
   German Wikipedia. ((*Note*)): To be kind to the wiki when this script is
   uploaded, I have broken the line, but it ((*may not be broken*)) in order
   to work with mwlink.

      de: Spezial Spezial_diskussion Diskussion Benutzer Benutzer_diskussion
      Bild Bild_diskussion Einordnung Einordnung_diskussion Wikipedia
      Wikipedia_talk WP Hilf Hilf_diskussion

   = WARNINGS

   * The program (like mediawiki) assumes links are not broken across line
     boundaries.
   * The mechanism for providing an alternate list of namespaces only works
     per-language; other wikis could have different namespaces, too.
   * The list of wikis and their abbreviations is doubtlessly incomplete.
   * The initialization file mechanism is not that useful for a shared daemon.
   * In command-line mode, it's very difficult to process ASCII em-dashes (--)
     correctly and still honor command-line options. mwlink gets it wrong, and
     that's one reason daemon mode is preferred.

   = AUTHOR

   Demi @ Wikipedia - http://en.wikipedia.org/wiki/User:Demi

   =end

   require 'cgi'
   require 'iconv'
   require 'getoptlong'
   require 'webrick'
   include WEBrick

   $opt = {
      'default-wiki' => 'wikipedia',
      'default-language' => 'en',
      'encoding' => 'utf-8'
   }

   class String

      def initcap()
         new = self.dup
         # Okay, I consider it dumb that a string subscripted produces an
         # integer --Demi
         new0 = new0].chr.upcase
         return new
      end

      def initcap!()
         self0 = self0].chr.upcase
         return self
      end

   end

   class Canon

      def initialize()
         @ns = { }
         @ns_array = %w(Media Special Talk User User_talk Project Project_talk
            Image Image_talk MediaWiki MediaWiki_talk Template Template_talk Help
            Help_talk Category Category_talk Wikipedia Wikipedia_talk WP)
         @ns'default' = { }
         @ns_array.each { |nspc| @ns'default'][nspc = nspc }

         if File::readable?(ENV'HOME' + '/.mwlinkrc')
            IO::foreach(ENV'HOME' + '/.mwlinkrc') { |line|
               next if line =~ /^\s*\#/
               next if line =~ /^\s*$/
               line.chomp!
               if m = line.match(/^(\w+)\:(.*)$/)
                  lang    = m1
                  nslist  = m2].split
                  @nslang = { }
                  nslist.each { |nspc| @nslang][nspc = nspc }
               end
            }
         end

         @wiki = {
            'Wiktionary' => 'wiktionary',
            'Wikt' => 'wiktionary',
            'W' => 'wikipedia',
            'M' => 'meta',
            'N' => 'news',
            'Q' => 'quote',
            'B' => 'books',
            'Meta' => 'meta',
            'Wikibooks' => 'books',
            'Commons' => 'commmons',
            'Wikisource' => 'source'
         }

         @wikispec = {
            'wikipedia' => { 'domain' => 'wikipedia.org', 'lang' => 1 },
            'wiktionary' => { 'domain' => 'wiktionary.org', 'lang' => 1 },
            'meta' => { 'domain' => 'meta.wikimedia.org', 'lang' => 0 },
            'books' => { 'domain' => 'wikibooks.org', 'lang' => 1 },
            'commons' => { 'domain' => 'commmons.wikimedia.org', 'lang' => 0 },
            'source' => { 'domain' => 'sources.wikimedia.org', 'lang' => 0 },
            'news' => { 'domain' => 'wikinews.org', 'lang' => 1 },
         }

         @cs = Iconv.new("iso-8859-1", $opt'encoding')

      end

      #TODO The % part of the # section of the URL should become a dot.

      def urlencode(s)
         CGI::escape(s).gsub(/%3[Aa]/, ':').gsub(/%2[Ff]/, '/').gsub(/%23/, '#')
      end

      def canonword(word)
         s = word.strip.squeeze(' ').tr(' ', '_').initcap

         begin
            @cs.iconv(s)
         rescue Iconv::IllegalSequence
            s
         end
      end

      def parselink(link)
         l = {
            'namespace' => '',
            'language' => $opt'default-language',
            'wiki' => $opt'default-wiki',
            'title' => ''
         }
         terms = link.split(':')
         l'title' = canonword(terms.pop)
         terms.each { |term|
            next if term.nil? or term.empty?

            t = canonword(term)

            if @nsl'language']]
            then
               ns = @nsl'language']]
            else
               ns = @ns'default'
            end

            if ns.key?(t)
               l'namespace' = nst
            elsif @wiki.key?(t)
               l'wiki' = @wikit
            else
               l'language' = t.downcase
            end
         }

         l
      end

      def canonicalize(link)
         linkdesc = parselink(link.sub(/\|.*$/, ''))

         if @wikispec.key?(linkdesc'wiki')
            ws = @wikispeclinkdesc'wiki']]
            host = ws'domain'
            if ws'lang' != 0
               host = linkdesc'language' + '.' + host
            end
         else
            host = linkdesc'wiki' + '.' + 'wikimedia.org'
         end

         uri =
            if linkdesc'namespace'].length > 0
               linkdesc'namespace' + ':' + linkdesc'title'
            else
               linkdesc'title'
            end

         r = urlencode('http://' + host + '/wiki/' + uri)
         r
      end

      def to_s()
         "Namespace sets: " + @ns.keys.join(', ') +
         "; Wikis: " + @wiki.to_a.join(', ')
      end
   end

   def linkexpand(c, bracketlink)
      linktext =
         if m = /\[\[([^\]]+)\]\]/.match(bracketlink)
            m1
         else
            bracketlink
         end

      bracketlink +
         " <" + c.canonicalize(linktext) + ">"
   end

   c = Canon.new()
   re = /\[\[\s*[^\s\\][^\]]+\]\]/

   class MwlinkServlet < HTTPServlet::AbstractServlet

      def initialize(server, canonicalizer)
         super(server)
         @c = canonicalizer
      end

      def do_GET(rq, rs)
         p = CGI.parse(rq.query_string)
         # Just for testing
         l = @c.canonicalize(p'page'][0)
         rs.status = 302
         rs'Location' = l
         rs.body = "<html><body>\n" +
            "<a href=\"#{l}\">#{p'page'][0}</a>\n" +
                     "</body></html>\n"
      end
   end

   begin
      GetoptLong::new(
         '--default-wiki',     GetoptLong::REQUIRED_ARGUMENT,
         '--default-language', GetoptLong::REQUIRED_ARGUMENT,
         '--encoding',         GetoptLong::REQUIRED_ARGUMENT,
         '--daemon',           GetoptLong::OPTIONAL_ARGUMENT
      ).each do |k, v|
         k = k.sub(/^--/,'')

         case k

         when 'default-wiki', 'default-language', 'encoding'
            $optk = v

         when 'daemon'
            $opt'daemon' = true
            if v.empty?
               $opt'port' = 4242
            else
               $opt'port' = v
            end
         end
      end
   rescue GetoptLong::InvalidOption
      true
   end

   if $opt'daemon'

      port = $opt'port'].to_i

      puts "Starting daemon on port #{port}"
      s = HTTPServer.new(:Port => port)
      s.mount("/mwlink", MwlinkServlet, c)

      trap('INT') { s.shutdown }

      s.start

   else

      # Note, there are various combinations of -- appearing in normal text that
      # will break this. --daemon is the recommended method.
      if ARGV.empty?
         STDIN.each_line { |line|
            puts line.chomp.gsub(re) { |expr| linkexpand(c, expr) }
         }
      else
         puts ARGV.join(' ').gsub(re) { |expr| linkexpand(c, expr) }
      end

   end

Example output:

 [[Ashland (disambiguation)]] is an example of a
 [[Wikipedia:Disambiguation]] page.
 [[Ashland (disambiguation)]] <http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29> is an example of a
 [[Wikipedia:Disambiguation]] <http://en.wikipedia.org/wiki/Wikipedia:Disambiguation> page.
 GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29
 GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29 --> 302 Found
 GET http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29 --> ...(page content)

The GET program is a utility distributed with Perl's libwww. Also, note that wikimedia servers forbid scripts based on the LWP Perl module.

From Wikipedia, the free encyclopedia

This Ruby program has two modes. It can run as a daemon or text processor (daemon mode is preferred, since it's more efficient).

In text-scanning mode, it interprets its command line (or stdin if no command line given) as text possibly containing [[wikilinks]]. It preserves the original text and adds a text hyperlink (the http: address contained in <> braces).

In daemon mode, it receives HTTP requests like http://localhost:4242/mwlink?page=wiki-page-name and redirects to the appropriate Wikimedia page. It's convenient for scripts to just use that URL rather than constructing one themselves--all they have to do is URL-escape the text between [[ and ]].

   #!/usr/bin/ruby

   # This script is dual-licensed under the GPL version 2 or any later
   # version, at your option. See http://www.gnu.org/licenses/gpl.txt for more
   # details.

   =begin

   = NAME

   mwlink - Linkify mediawiki-style wikilinks in plain text

   = SYNOPSIS

      mwlink options text-to-wikilink
         --daemon[=port     Run as HTTP daemon
         --encoding          Default character set encoding (utf-8)
         --default-wiki      Default wiki (wikipedia)
         --default-language  Default language (en)

   = DESCRIPTION

   In text-scanning mode (without the --daemon argument) The mwlink program scans
   its arguments (or its standard input, in the event of no arguments) for
   wikilinks of the form [[link]]. It expands such links into URLs and inserts
   them into the original text after the [[link]] in sharp braces ((({<})) and
   (({>}))). Options are provided for specifying a default wiki (the wiki to link
   to if no qualifier is given in the link) and a default language (the language
   to assume if no qualifier is given) as well as the character set encoding in
   use. The built-in defaults are ((*wikipedia*)), ((*en*)) and ((*utf-8*)),
   respectively.

   In daemon mode (now preferred), It receives HTTP requests of the form
   "http://.../page=((*wikipedia page*))" (the ((*wikipedia page*)) name is what
   would appear within a [[wikilink]]. URL-escaping is required but no other
   processing, making it convenient to use from scripts.

   == Initialization File

   The names of namespaces vary in different languages (especially due to
   language. For example, "User:" in English is "Benutzer:" in German. You can
   specify lists of namespaces to use for particular languages in an
   initialization file (({~/.mwlinkrc})). This is simply a line with the
   language, a colon, and a space-separated list of namespaces in that
   language. When interpreting links for that language (either because
   ((*--default-language*)) was specified or there is a language qualifier in
   the link, mwlink will recognize it as a namespace appropriately. All the
   namespaces must appear on one line--line continuation is not supported.

   Comments (lines introduced with (({#}})) (pound sign)) are comments, and
   are ignored, along with blank lines.

   Here is an example configuration containing (only) some namespaces from the
   German Wikipedia. ((*Note*)): To be kind to the wiki when this script is
   uploaded, I have broken the line, but it ((*may not be broken*)) in order
   to work with mwlink.

      de: Spezial Spezial_diskussion Diskussion Benutzer Benutzer_diskussion
      Bild Bild_diskussion Einordnung Einordnung_diskussion Wikipedia
      Wikipedia_talk WP Hilf Hilf_diskussion

   = WARNINGS

   * The program (like mediawiki) assumes links are not broken across line
     boundaries.
   * The mechanism for providing an alternate list of namespaces only works
     per-language; other wikis could have different namespaces, too.
   * The list of wikis and their abbreviations is doubtlessly incomplete.
   * The initialization file mechanism is not that useful for a shared daemon.
   * In command-line mode, it's very difficult to process ASCII em-dashes (--)
     correctly and still honor command-line options. mwlink gets it wrong, and
     that's one reason daemon mode is preferred.

   = AUTHOR

   Demi @ Wikipedia - http://en.wikipedia.org/wiki/User:Demi

   =end

   require 'cgi'
   require 'iconv'
   require 'getoptlong'
   require 'webrick'
   include WEBrick

   $opt = {
      'default-wiki' => 'wikipedia',
      'default-language' => 'en',
      'encoding' => 'utf-8'
   }

   class String

      def initcap()
         new = self.dup
         # Okay, I consider it dumb that a string subscripted produces an
         # integer --Demi
         new0 = new0].chr.upcase
         return new
      end

      def initcap!()
         self0 = self0].chr.upcase
         return self
      end

   end

   class Canon

      def initialize()
         @ns = { }
         @ns_array = %w(Media Special Talk User User_talk Project Project_talk
            Image Image_talk MediaWiki MediaWiki_talk Template Template_talk Help
            Help_talk Category Category_talk Wikipedia Wikipedia_talk WP)
         @ns'default' = { }
         @ns_array.each { |nspc| @ns'default'][nspc = nspc }

         if File::readable?(ENV'HOME' + '/.mwlinkrc')
            IO::foreach(ENV'HOME' + '/.mwlinkrc') { |line|
               next if line =~ /^\s*\#/
               next if line =~ /^\s*$/
               line.chomp!
               if m = line.match(/^(\w+)\:(.*)$/)
                  lang    = m1
                  nslist  = m2].split
                  @nslang = { }
                  nslist.each { |nspc| @nslang][nspc = nspc }
               end
            }
         end

         @wiki = {
            'Wiktionary' => 'wiktionary',
            'Wikt' => 'wiktionary',
            'W' => 'wikipedia',
            'M' => 'meta',
            'N' => 'news',
            'Q' => 'quote',
            'B' => 'books',
            'Meta' => 'meta',
            'Wikibooks' => 'books',
            'Commons' => 'commmons',
            'Wikisource' => 'source'
         }

         @wikispec = {
            'wikipedia' => { 'domain' => 'wikipedia.org', 'lang' => 1 },
            'wiktionary' => { 'domain' => 'wiktionary.org', 'lang' => 1 },
            'meta' => { 'domain' => 'meta.wikimedia.org', 'lang' => 0 },
            'books' => { 'domain' => 'wikibooks.org', 'lang' => 1 },
            'commons' => { 'domain' => 'commmons.wikimedia.org', 'lang' => 0 },
            'source' => { 'domain' => 'sources.wikimedia.org', 'lang' => 0 },
            'news' => { 'domain' => 'wikinews.org', 'lang' => 1 },
         }

         @cs = Iconv.new("iso-8859-1", $opt'encoding')

      end

      #TODO The % part of the # section of the URL should become a dot.

      def urlencode(s)
         CGI::escape(s).gsub(/%3[Aa]/, ':').gsub(/%2[Ff]/, '/').gsub(/%23/, '#')
      end

      def canonword(word)
         s = word.strip.squeeze(' ').tr(' ', '_').initcap

         begin
            @cs.iconv(s)
         rescue Iconv::IllegalSequence
            s
         end
      end

      def parselink(link)
         l = {
            'namespace' => '',
            'language' => $opt'default-language',
            'wiki' => $opt'default-wiki',
            'title' => ''
         }
         terms = link.split(':')
         l'title' = canonword(terms.pop)
         terms.each { |term|
            next if term.nil? or term.empty?

            t = canonword(term)

            if @nsl'language']]
            then
               ns = @nsl'language']]
            else
               ns = @ns'default'
            end

            if ns.key?(t)
               l'namespace' = nst
            elsif @wiki.key?(t)
               l'wiki' = @wikit
            else
               l'language' = t.downcase
            end
         }

         l
      end

      def canonicalize(link)
         linkdesc = parselink(link.sub(/\|.*$/, ''))

         if @wikispec.key?(linkdesc'wiki')
            ws = @wikispeclinkdesc'wiki']]
            host = ws'domain'
            if ws'lang' != 0
               host = linkdesc'language' + '.' + host
            end
         else
            host = linkdesc'wiki' + '.' + 'wikimedia.org'
         end

         uri =
            if linkdesc'namespace'].length > 0
               linkdesc'namespace' + ':' + linkdesc'title'
            else
               linkdesc'title'
            end

         r = urlencode('http://' + host + '/wiki/' + uri)
         r
      end

      def to_s()
         "Namespace sets: " + @ns.keys.join(', ') +
         "; Wikis: " + @wiki.to_a.join(', ')
      end
   end

   def linkexpand(c, bracketlink)
      linktext =
         if m = /\[\[([^\]]+)\]\]/.match(bracketlink)
            m1
         else
            bracketlink
         end

      bracketlink +
         " <" + c.canonicalize(linktext) + ">"
   end

   c = Canon.new()
   re = /\[\[\s*[^\s\\][^\]]+\]\]/

   class MwlinkServlet < HTTPServlet::AbstractServlet

      def initialize(server, canonicalizer)
         super(server)
         @c = canonicalizer
      end

      def do_GET(rq, rs)
         p = CGI.parse(rq.query_string)
         # Just for testing
         l = @c.canonicalize(p'page'][0)
         rs.status = 302
         rs'Location' = l
         rs.body = "<html><body>\n" +
            "<a href=\"#{l}\">#{p'page'][0}</a>\n" +
                     "</body></html>\n"
      end
   end

   begin
      GetoptLong::new(
         '--default-wiki',     GetoptLong::REQUIRED_ARGUMENT,
         '--default-language', GetoptLong::REQUIRED_ARGUMENT,
         '--encoding',         GetoptLong::REQUIRED_ARGUMENT,
         '--daemon',           GetoptLong::OPTIONAL_ARGUMENT
      ).each do |k, v|
         k = k.sub(/^--/,'')

         case k

         when 'default-wiki', 'default-language', 'encoding'
            $optk = v

         when 'daemon'
            $opt'daemon' = true
            if v.empty?
               $opt'port' = 4242
            else
               $opt'port' = v
            end
         end
      end
   rescue GetoptLong::InvalidOption
      true
   end

   if $opt'daemon'

      port = $opt'port'].to_i

      puts "Starting daemon on port #{port}"
      s = HTTPServer.new(:Port => port)
      s.mount("/mwlink", MwlinkServlet, c)

      trap('INT') { s.shutdown }

      s.start

   else

      # Note, there are various combinations of -- appearing in normal text that
      # will break this. --daemon is the recommended method.
      if ARGV.empty?
         STDIN.each_line { |line|
            puts line.chomp.gsub(re) { |expr| linkexpand(c, expr) }
         }
      else
         puts ARGV.join(' ').gsub(re) { |expr| linkexpand(c, expr) }
      end

   end

Example output:

 [[Ashland (disambiguation)]] is an example of a
 [[Wikipedia:Disambiguation]] page.
 [[Ashland (disambiguation)]] <http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29> is an example of a
 [[Wikipedia:Disambiguation]] <http://en.wikipedia.org/wiki/Wikipedia:Disambiguation> page.
 GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29
 GET http://localhost:4242/mwlink?page=Ashland+%28disambiguation%29 --> 302 Found
 GET http://en.wikipedia.org/wiki/Ashland_%28disambiguation%29 --> ...(page content)

The GET program is a utility distributed with Perl's libwww. Also, note that wikimedia servers forbid scripts based on the LWP Perl module.


Videos

Youtube | Vimeo | Bing

Websites

Google | Yahoo | Bing

Encyclopedia

Google | Yahoo | Bing

Facebook