This page is currently inactive and is retained for
historical reference. Either the page is no longer relevant or consensus on its purpose has become unclear. To revive discussion, seek broader input via a forum such as the village pump. |
Parse::MediaWikiDump is a Perl module created by Triddle that makes accessing the information in a MediaWiki dump file easy. Its successor MediaWiki::DumpFile is written by the same author and also available on the CPAN.
The latest versions of Parse::MediaWikiDump and MediaWiki::DumpFile are available at https://metacpan.org/pod/Parse::MediaWikiDump and https://metacpan.org/pod/MediaWiki::DumpFile
#!/usr/bin/perl -w
use strict;
use Parse::MediaWikiDump;
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
while(defined($page = $pages->next)) {
#main namespace only
next unless $page->namespace eq '';
print $page->title, "\n" unless defined($page->categories);
}
This program does not follow the proper case sensitivity rules for matching article titles; see the documentation that comes with the module for a much more complete version of this program.
#!/usr/bin/perl -w
use strict;
use Parse::MediaWikiDump;
my $file = shift or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my %redirs;
while(defined(my $page = $pages->page)) {
next unless $page->namespace eq '';
next unless defined($page->redirect);
my $title = $page->title;
$redirs{$title} = $page->redirect;
}
while (my ($key, $redirect) = each(%redirs)) {
if (defined($redirs{$redirect})) {
print "$key\n";
}
}
#!/usr/bin/perl
use Parse::MediaWikiDump;
use DBI;
use DBD::mysql;
$server = "localhost";
$name = "dbname";
$user = "admin";
$password = "pass";
$dsn = "DBI:mysql:database=$name;host=$server;";
$dbh = DBI->connect($dsn, $user, $password);
$source = 'pages_articles.xml';
$pages = Parse::MediaWikiDump::Pages->new($source);
print "Done parsing.\n";
while(defined($page = $pages->page)) {
$c = $page->categories;
if (grep {/Mathematics/} @$c) { # all categories with the string "Mathematics" anywhere in their text.
# For exact match, use {$_ eq "Mathematics"}
$id = $page->id;
$title = $page->title;
$text = $page->text;
#$dbh->do("insert ..."); #details of SQL depend on the database setup
print "title '$title' id $id was inserted.\n";
}
}
The script checks if an article contains interwikis to :de, :es, :it, :ja and :nl BUT not :fr. It is useful to link "popular" articles to a specific wiki. It may also give useful hints about articles that should be translated in priority.
#!/usr/bin/perl -w
# Code : Dake
use strict;
use Parse::MediaWikiDump;
use utf8;
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
binmode STDOUT, ":utf8";
while(defined($page = $pages->next)) {
#main namespace only
next unless $page->namespace eq '';
my $text = $page->text;
if (($$text =~ /\[\[de:/i) && ($$text =~ /\[\[es:/i) &&
($$text =~ /\[\[nl:/i) && ($$text =~ /\[\[ja:/i) &&
($$text =~ /\[\[it:/i) && !($$text =~ /\[\[fr:/i))
{
print $page->title, "\n";
}
}
This page is currently inactive and is retained for
historical reference. Either the page is no longer relevant or consensus on its purpose has become unclear. To revive discussion, seek broader input via a forum such as the village pump. |
Parse::MediaWikiDump is a Perl module created by Triddle that makes accessing the information in a MediaWiki dump file easy. Its successor MediaWiki::DumpFile is written by the same author and also available on the CPAN.
The latest versions of Parse::MediaWikiDump and MediaWiki::DumpFile are available at https://metacpan.org/pod/Parse::MediaWikiDump and https://metacpan.org/pod/MediaWiki::DumpFile
#!/usr/bin/perl -w
use strict;
use Parse::MediaWikiDump;
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
while(defined($page = $pages->next)) {
#main namespace only
next unless $page->namespace eq '';
print $page->title, "\n" unless defined($page->categories);
}
This program does not follow the proper case sensitivity rules for matching article titles; see the documentation that comes with the module for a much more complete version of this program.
#!/usr/bin/perl -w
use strict;
use Parse::MediaWikiDump;
my $file = shift or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my %redirs;
while(defined(my $page = $pages->page)) {
next unless $page->namespace eq '';
next unless defined($page->redirect);
my $title = $page->title;
$redirs{$title} = $page->redirect;
}
while (my ($key, $redirect) = each(%redirs)) {
if (defined($redirs{$redirect})) {
print "$key\n";
}
}
#!/usr/bin/perl
use Parse::MediaWikiDump;
use DBI;
use DBD::mysql;
$server = "localhost";
$name = "dbname";
$user = "admin";
$password = "pass";
$dsn = "DBI:mysql:database=$name;host=$server;";
$dbh = DBI->connect($dsn, $user, $password);
$source = 'pages_articles.xml';
$pages = Parse::MediaWikiDump::Pages->new($source);
print "Done parsing.\n";
while(defined($page = $pages->page)) {
$c = $page->categories;
if (grep {/Mathematics/} @$c) { # all categories with the string "Mathematics" anywhere in their text.
# For exact match, use {$_ eq "Mathematics"}
$id = $page->id;
$title = $page->title;
$text = $page->text;
#$dbh->do("insert ..."); #details of SQL depend on the database setup
print "title '$title' id $id was inserted.\n";
}
}
The script checks if an article contains interwikis to :de, :es, :it, :ja and :nl BUT not :fr. It is useful to link "popular" articles to a specific wiki. It may also give useful hints about articles that should be translated in priority.
#!/usr/bin/perl -w
# Code : Dake
use strict;
use Parse::MediaWikiDump;
use utf8;
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
binmode STDOUT, ":utf8";
while(defined($page = $pages->next)) {
#main namespace only
next unless $page->namespace eq '';
my $text = $page->text;
if (($$text =~ /\[\[de:/i) && ($$text =~ /\[\[es:/i) &&
($$text =~ /\[\[nl:/i) && ($$text =~ /\[\[ja:/i) &&
($$text =~ /\[\[it:/i) && !($$text =~ /\[\[fr:/i))
{
print $page->title, "\n";
}
}