From Wikipedia, the free encyclopedia
Archive 1 Archive 2 Archive 3

Static HTML dumps page always down !

http://static.wikipedia.org/
Hello, the Static HTML dumps download page is never working. Is there a non-wikipedian mirror where we can download the complete HTML version of en.wiki (and fr:wiki as well)? Or maybe a torrent? 13:16, 5 September 2007 (UTC)

Database backups of enwiki keep getting cancelled

The enwiki database dumps seem to either get cancelled or fail. Surely considering the importance of English wikipedia this seems like a critical problem, is anyone working on fixing it? Looks like there hasn't been a proper backup of enwiki for over 2 months. This needs escalating, but there are no obvious ways of doing so. -- Alun Liggins 19:51, 5 October 2007 (UTC)

I spoke to someone on the wikitech channel on IRC and they told me that the problem is on someone's ToDo list, but that it's a 'highly complicated' problem. I don't know what this means in terms of its prospects for getting done. Maybe we can find out who is responsible for moving forward with this? I wonder if there is something we can do to help in the meantime. Bestchai 04:44, 6 October 2007 (UTC)
Thanks Bestchai, Looking at the backups, there are other backups after the "every edit ever" backup that just includes the current pages, surely it would be better to skip to those, so at least we'd have the last edit rather than nothing at all in the event of a disaster. Reading Wikipedia Weekly episode 31 it appears that the dumps just mysteriously fail after a while. Maybe these aren't the only backups they do, and they do tape/disk copies of the main datafiles too? Anyone know? If this is the only backup then this should be the absolute number 1 priority for the Wikipedia foundation to get this fixed today. -- Alun Liggins 13:57, 6 October 2007 (UTC)
Complete absence of any feedback from the administrators, perhaps I've just not looked in the correct place, or they are avoiding the issue? -- Alun Liggins 17:40, 12 October 2007 (UTC)
I went on IRC and asked in #wikimedia-tech, here's a log of what I got: http://pastebin.ca/734865 and http://leuksman.com/log/2007/10/02/wiki-data-dumps/ 69.157.3.164 02:40, 13 October 2007 (UTC)
Database dumps now look totally broken/stalled. I've not been able to determine (anywhere) if this is the sole method of backing up the databases. -- Alun Liggins ( talk) 21:20, 13 December 2007 (UTC)
Now the enwiki ones have failed for pages-articles, no one seems concerned or wants to talk about it. The two messages I posted onto the wikitech-l have been deleted where I asked about the backups. I would have thought that backups take precedent over other fluffier things. -- Alun Liggins ( talk) 19:47, 21 December 2007 (UTC)
I've not been able to get a viable pages-meta-history.xml.7z file for at least 4 months. I think this is a reasonably serious issue. Is anyone out there maintaining an archive of successful full backups, or at least a list of the successful backups for enwiki? Is anyone actually looking at this problem? Aidepkiwi ( talk) 17:52, 10 January 2008 (UTC)

Where can I download an enwiki XML dump ?

Hello, could someone who knows about this please post the link to the exact file, pages-articles.xml.bz2, for me please. The more recent the better. I did read the explanations but when I go to http://download.wikimedia.org/enwiki/ I can't download anything useful. When I get to http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml I then click http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2 but I get error 404. Are there any older versions I can download ? Thanks in advance. Jackaranga 01:00, 20 October 2007 (UTC)

Ok, no error 404 anymore problem solved. Jackaranga 13:40, 23 October 2007 (UTC)

a simple table!

Hi, I'm editing the article MS. Actually, it's more of a disambiguation page or a list. There is a little debate there about how we could format the list. I couldn't help but think of my good old database programming class. There we could pick and chose how we wanted the information to be displayed. I think it would be really handy to be able to make a table and sort it in various ways and have it display on wikipedia the way that the final user would like. For example, the article could sort the list by MS, mS, Ms, M.S., etc... or By category: medical, Aviation, etc..., then by alphabetical, etc...? I can't place 2 and 2 toghether on how SQL and a regular articles (wikipedia's database) technologies could be implemented together. -- CyclePat ( talk) 03:28, 26 November 2007 (UTC)

Main namespace dump

Are there any dumps of only the main namespace? It would be simple to do on my own, but it would be time consuming and memory intensive, and it seems like this would be something useful for other users. My parser is getting bogged down on old archived Wikipedia namespace pages which aren't particularly useful for my purposes, so it would be nice to have only the actual articles. Thanks. Pkalmar ( talk) 01:58, 21 December 2007 (UTC)

Arrrrgh, I can't read/find current dumps of the file

I can't seem to be able read the file on my computer. Any help? And were is the current dump, I accidently downloaded a old one. ( TheFauxScholar ( talk) 03:27, 4 April 2008 (UTC))

Inability to look after the database dumps

You would think the wikimedia foundation, with all the funding it gets, would be able to actually deliver the part of the open sources license that dictates that the "source" (i.e. dumps of the database) actually happen. Currently they constantly violate this, then shout and scream (as is the wikipedia way) at people who ask why there are no recent dumps. Hopefully someone will make a fork and run it properly, oh, but hang on, the wikimedia foundation "seem" to almost deliberately restrict access to the pictures.... so no fork of wikipedia then....! —Preceding unsigned comment added by 77.96.111.181 ( talk) 18:33, 18 May 2008 (UTC)

Static HTML dump

Does anyone know why no static HTML dump is available? I also asked this question here. Bovlb ( talk) 23:11, 22 May 2008 (UTC)

Big silence. I see that people have also been asking on wikitech, but there's been no comment since 2008-03-06, when it was a week away. Is there somewhere else I should be asking? Bovlb ( talk) 23:21, 30 May 2008 (UTC)

Last enwiki dump deleted?

Where is the last worked full history dump?

http://www.mail-archive.com/dbpedia-discussion@lists.sourceforge.net/msg00135.html

http://download.wikimedia.org/enwiki/20080103/ —Preceding unsigned comment added by 93.80.82.194 ( talk) 19:07, 15 June 2008 (UTC)


where are mentions of the API?

am i insane, or does this article completely ignore user options with the mediawiki API? http://www.mediawiki.org/wiki/API Aaronbrick ( talk) 00:22, 8 October 2008 (UTC)

Hi there, f i download static html dumps how can i later organize them? I mean how can i browse thse articles in some gui or do i need just to browse it in windows directories? —Preceding unsigned comment added by 62.80.224.244 ( talk) 09:54, 20 October 2008 (UTC)

Scripted access to Special:Export forbidden?

Hi all. I was wanting to access a subset of pages for the purposes of doing wordcount anaylsis on them. Maybe a couple hundred all told. I have tried to access these pages in XML via Special:Export (i.e. http://en.wikipedia.org/wiki/Special:Export/PAGE_NAME) but received a 403 back. I looked at Wikipedia's robots.txt but cannot find any rule for Special:Export itself. Anyone know why I might be failing here? AFAIK my IP is not banned or anything like that - it's a normal server-to-server request. —Preceding unsigned comment added by Nrubdarb ( talkcontribs) 11:08, 9 November 2008 (UTC)

enwiki dumps failing?

24 January 2009: the newest available dump (with only the articles, the archive most mirror sites will probably want.) for the English Wikipedia enwiki is the version enwiki-20081008-pages-articles.xml.bz2 from October 2008. From the outside it looks as if a dump in progress of pages-meta-history.xml.bz2 blocks further exports. Is there a way to report this problem? —Preceding unsigned comment added by 92.75.194.40 ( talk) 17:22, 24 January 2009 (UTC)


There is an enwiki dump currently processing at the moment, but the previous one failed a number of jobs: [1]

In particular:

2007-04-02 14:05:27 failed Articles, templates, image descriptions, and primary meta-pages. This contains current versions of article content, and is the archive most mirror sites will probably want. pages-articles.xml.bz2

I can't seem to find any older dumps either...

67.183.26.86 17:56, 5 April 2007 (UTC)

I'm also trying to make a local mirror for use while traveling, but can't find a functioning older dump. Any word on when the next dump will be out?

Looks like with the failures, they've done another a day later here. That is 02-04-2007, whereas the other was 01-04-2007. The pages-articles.xml.bz2 is done 2007-04-07 12:59:23 done Articles, templates, image descriptions, and primary meta-pages link. Reedy Boy 15:27, 11 April 2007 (UTC)

I'm a little worried: I am a PhD student whose research is completely dependent on this data. With the last 2 dumps failing, and others being removed, there isnt a single complete dump of the english wikipedia data left available for download - and there hasn't been a new dump for almost a month. Has something happened? -- 130.217.240.32 04:47, 27 April 2007 (UTC)

A full, 84.6GB (compressed) dump has finally completed [2] (PLEASE, whoever runs the dumps, keep this one around at least until a newer version is complete, such that there's always at least one good, full dump). Regarding research, it's been suggested to dump random subsets in the requests for dumps on the meta site, but nothing's happened regarding it as of yet AFIACT.-- 81.86.106.14 22:38, 8 May 2007 (UTC)

Could someone do a statistical analysis of article referencing?

Over at Wikipedia:Improving referencing efforts we're trying to improve article referencing. It would be helpful if we had statistics on the proportion of articles with references, average references per article, etc. Does anyone have a bot that could help with this? Thanks. - Peregrine Fisher ( talk) ( contribs) 21:49, 15 December 2008 (UTC)

Access to wikipedia search queries

Can we get access to the wikipedia search queries? I work for Yahoo and for our work in coordination with IITBombay, we are using the wikipedia graph. Please let me know at manish_gupta12003 [at] yahoo.com if we can have an access to a small subset of such queries (without any session info or user info). Thanks! —Preceding unsigned comment added by 216.145.54.7 ( talk) 07:36, 28 April 2009 (UTC)

Missing Articles

I'm finding articles missing from the dumps.

Lets take a look at the first article in the dump:

AmericanSamoa (id 6)
#REDIRECT [[American Samoa]]{{R from CamelCase}}

Looking for redirect (American Samoa) in the dump:

grep -n ">American Samoa" enwiki-20090610-pages-articles.xml
72616: <title>American Samoa/Military</title>
6006367: <title>American Samoa/Economy</title>
7298422: <title>American Samoa/People</title>
7849801: <title>American Samoa/Geography</title>
7849816: <title>American Samoa/Government</title>
7849831: <title>American Samoa/Communications</title>
7849846: <title>American Samoa/Transportation</title>
44729785: <title>American Samoa at the 2004 Summer Olympics</title>
46998496: <title>American Samoan</title>
47801378: <title>American Samoa national football team</title>
65053327: <title>American Samoa territory, United States</title>
69169386: <title>American Samoa at the 2000 Summer Olympics</title>
74281114: <title>American Samoa Community College</title>
75011596: <text xml:space="preserve">American Samoa<noinclude>
79006314: <title>American Samoa national rugby league team</title>
97301557: <title>American Samoa National Park</title>
108306703: <title>American Samoa Territory Constitution</title>
119786420: <title>American Samoa at the 1996 Summer Olympics</title>
120802909: <title>American Samoa Fono</title>
120802959: <title>American Samoa House of Representatives</title>
120803207: <title>American Samoa Senate</title>
124460055: <title>American Samoa's At-large congressional district</title>
129254992: <title>American Samoan territorial legislature</title>
138027379: <title>American Samoa Football Association</title>

Can anyone shed any light on this? -- Smremde ( talk) 12:01, 25 June 2009 (UTC)

For reference, I've listed below all articles from the toolserver database wiht titles beginning "American Samoa" I'd expect to see in the dump. This does indeed show more titles (51 vs 25) - your grep has picked out the first 25 the titles. Possible causes - is your dump file incomplete, perhaps due to a (4GB?) file-size limit on your machine ? If not is grep on your system balking at the large file and giving up partway through ?- TB ( talk) 21:28, 25 July 2009 (UTC)
mysql> select page_id, page_title from page where page_namespace = 0 and page_title LIKE 'American_Samoa%' ORDER by 1 ASC;
+----------+-------------------------------------------------------------+
| page_id  | page_title                                                  |
+----------+-------------------------------------------------------------+
|     1116 | American_Samoa/Military                                     |
|    57313 | American_Samoa/Economy                                      |
|    74035 | American_Samoa/People                                       |
|    82564 | American_Samoa/Geography                                    |
|    82565 | American_Samoa/Government                                   |
|    82566 | American_Samoa/Communications                               |
|    82567 | American_Samoa/Transportation                               |
|   924548 | American_Samoa_at_the_2004_Summer_Olympics                  |
|   997978 | American_Samoan                                             |
|  1024704 | American_Samoa_national_football_team                       |
|  1687185 | American_Samoa_territory,_United_States                     |
|  1843552 | American_Samoa_at_the_2000_Summer_Olympics                  |
|  2058267 | American_Samoa_Community_College                            |
|  2262625 | American_Samoa_national_rugby_league_team                   |
|  3168052 | American_Samoa_National_Park                                |
|  3754535 | American_Samoa_Territory_Constitution                       |
|  4442190 | American_Samoa_at_the_1996_Summer_Olympics                  |
|  4504211 | American_Samoa_Fono                                         |
|  4504216 | American_Samoa_House_of_Representatives                     |
|  4504225 | American_Samoa_Senate                                       |
|  4738483 | American_Samoa's_At-large_congressional_district            |
|  5072984 | American_Samoan_territorial_legislature                     |
|  5656610 | American_Samoa_Football_Association                         |
|  7499304 | American_Samoa_at_the_1988_Summer_Olympics                  |
|  7821424 | American_Samoa_women's_national_football_team               |
|  7855525 | American_Samoa_at_the_1994_Winter_Olympics                  |
|  7855530 | American_Samoa_at_the_1992_Summer_Olympics                  |
|  7873035 | American_Samoa_Territorial_Police                           |
|  8213414 | American_Samoa_Department_of_Education                      |
|  8233086 | American_Samoan_general_election,_2008                      |
|  9682159 | American_Samoa_Constitution                                 |
| 10158999 | American_Samoa_national_rugby_union_team                    |
| 11944957 | American_Samoan_general_election,_2004                      |
| 11944974 | American_Samoan_legislative_election,_2006                  |
| 11944988 | American_Samoan_general_election,_2006                      |
| 12095118 | American_Samoa_national_soccer_team                         |
| 12225196 | American_Samoa_national_football_team_results               |
| 12869869 | American_Samoa_at_the_2007_World_Championships_in_Athletics |
| 12999445 | American_Samoa_national_basketball_team                     |
| 14938467 | American_Samoa_at_the_Olympics                              |
| 15587944 | American_Samoa's_results_and_fixtures                       |
| 15611124 | American_Samoa_Democratic_caucuses,_2008                    |
| 15641610 | American_Samoa_Republican_caucuses,_2008                    |
| 15681703 | American_Samoa_Republican_caucuses                          |
| 15929695 | American_Samoa_Territory_Supreme_Court                      |
| 16089779 | American_Samoa_at_the_2008_Summer_Olympics                  |
| 19468499 | American_Samoa_Governor                                     |
| 19785703 | American_Samoa_Supreme_Court                                |
| 19850654 | American_Samoa_gubernatorial_election,_2008                 |
| 19913142 | American_Samoa_Power_Authority                              |
| 20611195 | American_Samoa                                              |
+----------+-------------------------------------------------------------+
51 rows in set (0.00 sec)

Abstracts not available

I was trying to donwload the articles´ abstracts of the english wikipedia and just realized that the files are just a few KB big and only contain a couple of abstracts. I chequed older versions of the dump and dumps in other languages and there is the same situation. Does anyone know how to fix the dump of the articles´abstracts or know any other way to get those abstracts? —Preceding unsigned comment added by 141.45.202.255 ( talk) 10:07, 15 September 2009 (UTC)

which dumps contain redirects?

Possibly a dumb question, but which enwiki dumps contain redirects (i.e. entry including title & text). It would seem my "enwiki-20090902-pages-articles" doesn't have them? Thanks Rjwilmsi 23:13, 20 January 2010 (UTC)

It was a dumb question, the "pages-articles" XML dump does contain them. By default AWB's database scanner does not search redirects, but this default can be switched. Rjwilmsi 15:36, 2 February 2010 (UTC)

OLD data

- I've installed WIkimedia script

- downloaded and installed all tables with exclusion of OLD_TABLES...

Wikimedia don't works! I need absolutely the old_tables or not??

Thanks

Davide

—Preceding unsigned comment added by 80.116.41.176 ( talk) 19:27, 1 July 2005 (UTC)

Dump frequency and size

Recent mailing list posts [3] [4] [5] indicate developers have been busy with the Mediawiki 1.5 update recently, but that smaller, more specific dumps can now more easily be created. I'm hoping will see an article-namespace-only dump. The problem with talk pages is that the storage space required for them will grow without bound, whereas we can actually try to reduce the storage space needed for articles - or at least slow growth - by merging redundant articles, solving content problems, etc. Such a dump would also make analysis tools that look only at encyclopedic content (and there are a growing number of useful reports - see Wikipedia:Offline reports and Template:Active Wiki Fixup Projects) run faster and not take up ridiculously large amounts of hard drive space (which makes it more difficult to find computers capable of producing them).

User:Radiant! has been asking me about getting more frequent updates of category-related reports for a "Categorization Recent Changes Patrol" project. This is difficult to do without more frequent database dumps, and I'm sure there are a number of other reports that could benefit from being produced more frequently, or at least containing more up to date information. (And less human editor time would be wasted as a result, we hope.)

I'm actually surprised that the database dumps must be produced manually; I'm sure there's an engineering solution that could automate the process, reducing the burden on developers. I hope the developers will be able to deal with performance and other issues to be able to do this in the near future, so we can stop nagging them about database dump updates and get on with the work of fixing up the wiki. -- Beland 03:48, 8 July 2005 (UTC)

A small and fast XML dump file browser

Check WikiFilter.

It works with all current wiki project dump files in all languages. You do not need PHP or Mysql. All you need is a web server like Apache, and then you can view a wiki page through your web browser, either in normal html format, or in the raw wikitext format.

Rebuilding an index data-base is also reasonably fast. For example, the 3-GB English Wikipedia takes about 10 minutes on a Pentium4. Wanchun ( Talk) 06:35, 20 September 2005 (UTC)

Subsets of Image Dumps

Is there a good, approved way to download a relatively small number of images (say ~4000)? The tar download is kinda slow for this purpose. It'd be faster to web crawl (even with the 1-second delay) than to deal with the tar. But that seems disfavoured.

-- BC Holmes 21:07, 12 June 2006 (UTC)

Downloading (and uploading) blocked IP addys

Hi, I recently created a wiki for sharing code snippets and I've started to get some spam bots adding nonsense or spam to the articles. Rather than block them one at a time, I was thinking it'd be easier to preemptively use Wikipedia's IP block list and upload it into my wiki.

So, my question is, how do I download the block list -- and -- how do I upload it to my wiki? Thanks! - 206.160.140.2 13:49, 29 June 2006 (UTC)

Kiwix

If you are looking for an offline copy of wikipedia, I refer you to kiwix and a howto. Wizzy 08:55, 27 April 2010 (UTC)

Dump from 20100130 vs 20100312 (enwiki).

Why the enwiki dump completed in April/2010:

20100312 # 2010-04-16 08:46:23 done All pages with complete edit history (.7z) Everything is Ok * pages-meta-history.xml.7z 15.8 GB

is two times smaller than the previous dump completed in March/2010?

20100130 # 2010-03-26 00:57:06 done All pages with complete edit history (.7z) Everything is Ok * pages-meta-history.xml.7z 31.9 GB

-- Dc987 ( talk) 23:09, 29 April 2010 (UTC)

Read this section. It looks like all the 7z files are broken. The full dump is 280 GB. Rjwilmsi 07:05, 30 April 2010 (UTC)
Uh. I see. "Please note that more recent dumps (such as the 20100312 dump) are incomplete." -- Dc987 ( talk) 06:26, 1 May 2010 (UTC)
The current English Wikipedia dump is available in bz2 (280 GB) and 7z (30 GB). 7z size is so low due to its higher compress ratio. emijrp ( talk) 11:38, 16 August 2010 (UTC)

Downloading templates

Is there somewhere where I can download some of the templates that are in use here on wikipedia?

—Preceding unsigned comment added by 74.135.40.211 ( talk)

You probably want the pages-articles.xml.bz2. It has Articles, templates, image descriptions, and primary meta-pages. For example Reedy Boy 14:55, 13 February 2007 (UTC)
The above link returns 404. Is there somewhere else I can download all the templates used on Wikipedia? -- Magick93 11:22, 1 August 2007 (UTC)


here Wowlookitsjoe 18:40, 25 August 2007 (UTC)

Judging by the file name, this is all Wikipedia articles, and not just the templates - is that correct Wowlookitsjoe? -- Magick93 15:38, 1 September 2007 (UTC)

I am looking to download Wiki templates as well. What else will I need to work with it? FrontPage? Dreamweaver? —Preceding unsigned comment added by 84.59.134.169 ( talk) 19:31, August 27, 2007 (UTC)

I am looking for a way to download just the templates too. I am trying to import articles from Wikipedia that use templates, and nested templates preventing the articles from displaying correctly. Is there any way to just exclusively get the templates? —Preceding unsigned comment added by Coderoyal ( talkcontribs) 03:20, 15 June 2008 (UTC)

Yes agreed, even just a way to download the basic templates that you need to make an infobox and a few other things to get you started would be great.-- Kukamunga13 ( talk) 22:44, 8 January 2009 (UTC)

Yea i was searching for exactly for that described here. could anyone help? cheers

http://download.wikimedia.org/enwiki/20090713/ 207.250.116.150 ( talk) 18:59, 24 July 2009 (UTC)

Found this discussion aswell when trying to export templates - further googling revealed there is a method using special:export and specifying the articles you want to export, these can then be imported with special:import or the included XML import utilities. -- James.yale ( talk) 08:53, 4 September 2009 (UTC)

I, too, would LOVE a way to download templates without having to wade through some truly disparate garbage that some people (thankfully not all) call "documentation." This is the biggest flaw in Wikipedia - crappy documentation that doesn't tell newbies how to find much of ANYthing. Liquidfractal ( talk) 01:26, 18 August 2010 (UTC)

Hi all. I'm a bit worried about all this raw data we are writing every day, aka Wikipedia and wikis. Due to the closing of Geocities by Yahoo!, I learnt about Archive Team (it's a must see). Also, I knew about Internet Archive. Webpages are weak, hard disks can be destroyed in accidents, ect.

Wikimedia Foundation offer us dumps for all the projects, and I'm not sure how many people download them all. Some days ago I wrote a small script (I can share it, GPL) which downloads all the backups (only the 7z files, about ~100GB, I think they are the most important). Are you interested in get a copy of all the Wiki[mp]edia public data? I don't know about public mirrors of Wikipedia (although there are many for Linux ISOs). There are no public dumps of Nupedia or their mailing lists, only available through Internet Archive. [6] [7]. Today, that is history.

I wrote a blog post about this. I think that this is very important, we are writing all the human knowledge, and it needs to be saved in every continent, every country around the globe. Please, write your comments and suggestions. Regards. emijrp ( talk) 12:05, 16 August 2010 (UTC)

I have move it to User:Emijrp/Wikipedia Archive. emijrp ( talk) 08:41, 10 September 2010 (UTC)

freecache.org

Would adding http://freecache.org to the database dump files save bandwidth? Perhaps this could be experimented on for files < 1Gb & > 5Mb - Alterego

FreeCache has been deactivated. Edward ( talk) 23:06, 19 October 2010 (UTC)
    • Also, in relation to incremental updates, it seems that the only reason freecache wouldn't work is because old files aren't accessed often so they aren't cached. If some method could be devised whereas everyone who needed incremental updates accessed the files within the same time period, perhaps via an automated client, you could better utilize the ISPs bandwidth. I could be way off here. - Alterego

data base size?

huge artiucle, lots of information but not a single innuedno as to what is the approximate size of the wikiepdia data base...Can anyone help me find that out?-- Procrastinating@ talk2me 13:13, 25 December 2006 (UTC)

The last en.wiki dump i downloaded enwiki-20061130-pages-articles.xml expands to 7.93GB Reedy Boy 14:39, 25 December 2006 (UTC)
I think the indexes need an additional 30GB or so. I'm not sure yet because for the latest dump rebuildall.php has not completed after several days. ( SEWilco 00:35, 22 January 2007 (UTC))
there's no information as to the size of the images, sound video, etc. The only data I could find was this link from 2008 which said the images were 420GB. This is probably all of the languages put together, since articles share images. But I agree, the article should include some information about the DATABASE SIZE, since noone is ever going to print off wikipedia on paper. Brinerustle ( talk) 21:06, 22 October 2009 (UTC)

Wikimedia Commons images: 7M and growing

Some info about the growing ratio of uploaded images every month. Sizes are in MB. The largest bunch of images uploaded was in 2010-07: 207,946 images, +345 GB, 1.6 MB per image average. Regards. emijrp ( talk) 11:36, 16 August 2010 (UTC)

date	imagescount	totalsize	avgsizeperimage
2003-1	1	1.29717636	1.297176361084
2004-9	685	180.10230923	0.262923079163
2004-10	4171	607.82356930	0.145726101486
2004-11	3853	692.86077213	0.179823714543
2004-12	6413	1433.85031033	0.223584954050
2005-1	7754	1862.35064888	0.240179345999
2005-2	8500	2974.20539951	0.349906517590
2005-3	12635	3643.41866493	0.288359213687
2005-4	15358	5223.11017132	0.340090517731
2005-6	18633	6828.90809345	0.366495362714
2005-7	19968	8779.54927349	0.439680953200
2005-11	21977	9490.01213932	0.431815631766
2005-5	27811	10488.68217564	0.377141497092
2005-12	28540	10900.89676094	0.381951533320
2005-8	29787	12007.92203808	0.403126264413
2005-9	28484	13409.05960655	0.470757604499
2006-2	36208	14259.63428211	0.393825515966
2006-1	35650	14718.57859612	0.412863354730
2005-10	31436	16389.36539268	0.521356578212
2006-3	41763	18516.05062675	0.443360166338
2006-4	46645	21974.19114399	0.471094246843
2006-5	49399	28121.86408234	0.569280027578
2006-6	50893	28473.23626709	0.559472545676
2006-12	52577	29190.01748085	0.555186060080
2006-7	58384	30763.06628609	0.526909192349
2006-10	59611	32019.11058044	0.537134263482
2006-11	57070	32646.13846588	0.572036770035
2006-9	64991	35881.70751190	0.552102714405
2007-2	64787	37625.21335030	0.580752517485
2007-1	71989	38771.36503792	0.538573463139
2006-8	78263	48580.74944115	0.620737122793
2007-3	97556	51218.63132572	0.525017746994
2007-6	92656	60563.04164696	0.653633241743
2007-4	116127	69539.80562592	0.598825472336
2007-5	94416	69565.31412792	0.736795819860
2007-12	96256	72016.71782589	0.748179000020
2007-7	110470	72699.43968487	0.658092148863
2008-2	103295	73830.05222321	0.714749525371
2007-11	118250	78178.20839214	0.661126498031
2008-1	114507	80664.45367908	0.704449978421
2008-3	115732	84991.75799370	0.734384249764
2007-10	112011	85709.30096245	0.765186463494
2007-8	120324	87061.07968998	0.723555397842
2008-4	102342	93631.80365562	0.914891282715
2007-9	125487	95482.06631756	0.760892094939
2008-6	101048	95703.97241211	0.947113969718
2008-11	120943	110902.88121986	0.916984705356
2008-5	116491	112381.90718269	0.964726091996
2008-9	134362	114676.82158184	0.853491475133
2008-10	114676	116692.37883282	1.017583267927
2009-2	127162	119990.37194729	0.943602427984
2008-12	193181	120587.25548649	0.624219025093
2009-1	126947	121332.83420753	0.955775514250
2008-7	119541	122324.61021996	1.023285820095
2008-8	119471	124409.18394279	1.041333745786
2010-8	89776	164473.27284527	1.832040554773
2009-9	132803	176394.38845158	1.328240991932
2010-2	158509	182738.31155205	1.152857639327
2009-5	143389	182773.11001110	1.274666187860
2009-6	160070	186919.01145649	1.167732938442
2009-11	178118	188819.04353809	1.060078394874
2009-4	196699	202346.40691471	1.028710908112
2010-3	178439	206399.28073311	1.156693776210
2010-1	361253	216406.85650921	0.599045147055
2009-12	143797	217440.78340054	1.512137133602
2009-7	164040	230185.27826881	1.403226519561
2009-8	167079	250747.52404118	1.500772233741
2009-3	209453	260923.17462921	1.245736153835
2010-6	173658	270246.74141026	1.556200931775
2010-4	208518	297715.36278248	1.427768167652
2010-5	203093	297775.34260082	1.466201900611
2009-10	272818	329581.74736500	1.208064524207
2010-7	207946	345394.14518356	1.660979990880
This is 6.95788800953294 Tb in total. Rich  Farmbrough, 15:43, 17 November 2010 (UTC).

Another offline Wikipedia idea...But would this be acceptable (re)use of Wikipedia's data?

Hi everyone! :-)

After reading the Download Wikipedia pages earlier today, a few ideas have popped into my mind for a compacted, stripped down version of Wikipedia that would (Hopefully) end up being small enough to fit on standard DVD-5 media, and might be a useful resource for communities and regions where Internet access is either unavailable or prohibitively expensive. The idea that I've currently got in mind is primarily based on the idea of storing everything in 6-bit encoding (Then BZIPped) which - Though leading to a loss of anything other than very basic formatting and line/paragraph breaks - Would still retain the text of the articles themselves, which is the most important part of Wikipedia's content! :-)

Anyhow, I'm prone to getting ideas in mind, and then never acting upon them or getting sidetracked by other things...So unless I ever get all of the necessary bits 'n' pieces already coded and ready for testing, I'm not even going to consider drinking 7.00GB of Wikipedia's bandwidth on a mere personal whim. That said - Before I give more thought to the ideas in mind - I just wanted to check a few things with more knowledgeable users:

  • I believe that all text content on Wikipedia is freely available/downloadable under the GFDL (Including everything in pages-articles.xml.bz2). Is this correct, or are there copyrighted text passages that I might have to strip out in any such derivative work?
  • Given that my idea centres around preserving only the article texts in their original form - With formatting, TeX notation, links to external images/pages etc. removed where felt necessary - Would such use still fall under the scope of the GFDL, or would I be regarded as having created a plagarised work using Wikipedia's data?
  • Because of the differing sizes of assorted media and the fact that - If I managed to get myself actually doing something with this idea - I'd rather create something that could be scaled up or down to fit the target media, meaning that someone could create a smaller offline Wikipedia to fit on a 2GB pen-drive using a DVD version (~4.37GB) as the source if they so needed. Because such smaller versions would be made simply and quickly by just truncating the index and article files as appropriate, it'd help to have the most useful/popular articles sorted to the start of the article file. Therefore; Are any hit counts (Preferably counting only human visits - Not automated ones) kept against article pages, and - If so - Can these hit counts be downloaded separately, without having to download the huge ~36GB Wiki dump?
  • Finally: For obvious reasons, I wouldn't want to try downloading the whole 6.30GB compressed articles archive until I was certain that I'd created a workable conversion process and storage format that warranted a full-scale test...But having the top 5-10MB of the compressed dump would be useful for earlier testing and bug blasting. Would using Wget to do this on an occasional basis cause any problems or headaches?

Farewell for now, and many thanks in advance for any advice! >:-)

+++ DieselDragon +++ ( Talk) - 22 October 2010 CE = 23:44, 22 October 2010 (UTC)

I don't think a bzipped 6-bit-encoded image would be any smaller than bzipped 8-bit-encoded. Compression is about information theory, and the information in both versions is the same. Wizzy 07:39, 23 October 2010 (UTC)
I was thinking of using 6-bit encoding with a constrained alphabet/character table and possibly a database for common words to preserve the encyclopaedic text on its own (The element that I consider the most important) whilst disposing of less crucial elements such as images, wiki-markup, XML wrappers, and possibly links. This would serve the purpose of stripping out everything bar the core content itself, which should (In theory) reduce the size of the encyclopedia to managable everyday proportions whilst still keeping all of the essential elements. :-)
A simple analogy would be taking a newspaper - Removing all images, adverts, visual formatting and other unnecessary elements - And keeping just the text on it's own...Resulting in a document that's much smaller (In data terms) than the original. Using the encoding that I have in mind, the text-string "Wikipedia" which normally occupies 72 bits on it's own would be reduced to 54 bits. That said, my understanding of compression processes has never reached past the fundamental stage ( LZW gives me enough of a headache!) so I don't know if 6-bit would compress as well as it's 8-bit equivalent. :-)
+++ DieselDragon +++ ( Talk) - 26 October 2010 CE = 14:10, 26 October 2010 (UTC)
I agree with Wizzy that using some custom encoding won't help you much, if at all. Using some other compression algorithm ( 7zip?) would probably give you better results. Also, downloading the dump for yourself, even just to try something out isn't wasting Wikimedia bandwidth – those files are published so that they can be downloaded. Using wget the way you describe should be okay too.
Wikipedia switched licenses, so it's now licensed under CC-BY-SA. That means that you can do pretty much anything with the texts (including changing them in any way) as long as you give proper credit to the authors and release your modifications under the same license.
Svick ( talk) 17:25, 28 October 2010 (UTC)
Have you looked at Wikipedia:Version_1.0_Editorial_Team? Smallman12q ( talk) 22:20, 28 October 2010 (UTC)
  • Text only, articles, redirects, templates and categories would fit on a DVD I trow. If not then the place to look for saving is pulling out HTML comments, spurious spaces, persondata, external links (since we are positing no Internet access), interwiki links, maybe infoboxes, nav-boxes and so on.
  • Best compression is apparently arithmetic encoding, and certainly the 7z/b7 compression seems far more efficient.
  • Hit data is available, it is 50M per hour compressed. I have been thinking about consolidating this, for other reasons. Let me know if you decide to go ahead with this and I will look harder at doing that.

Regards, Rich  Farmbrough, 10:47, 17 November 2010 (UTC).

What is pages-meta-current.xml.bz2

I note that it isn't currently possible to download a data dump due to server maintenance issues, however I did find a dump of a file called enwiki-20091017-pages-meta-current.xml.bz2 on BitTorrent via The Pirate Bay. Is this the same data as pages-articles.xml.bz2?

One further query: has the Wikimedia Foundation considered making its data dumps available via BitTorrent? If you did, it might remove some load from your servers. -- Cabalamat ( talk) 01:40, 14 December 2010 (UTC)

The download server is up now. As far as I know, using BitTorent wasn't considered, presumably because the load from the downloads is not a problem (but there is nothing stopping you or anyone else from creating those torrents). As for your question, the difference between pages-meta-current and pages-articles is that the former contains all pages, while the latter doesn't contain talk pages and user pages. Svick ( talk) 01:34, 20 December 2010 (UTC)

Where is the Dump of this Wikia?

Hi all, a couple of days ago I wrote my problem in the LyricsWiki's Help Desk page but they didn't answer anything! I thought there is a relationship between Wikia projects & Wikipedia, so I decided to ask that here. This is my problem I wrote there:

"Hi, I'm lookin' for this wikia's dump, like any other wikia I went to "Statistics" page, and the Dump file (Which were in the bottom of the page in other Wikias) was not there!! Why there is not any dumps? Isn't it free in all Wiki projects? "

So, in [8] should be a dump of LyricsWiki, but there isn't! Why? Isn't it a free download? -- MehranVB talk | mail 16:14, 21 December 2010 (UTC)

Wikia and Wikipedia aren't really related, so this is a bad place to ask. I'd suggest asking at Wikia's Community Central Forum. Svick ( talk) 22:54, 25 December 2010 (UTC)
Thanks! -- MehranVB talk | mail 06:35, 26 December 2010 (UTC)

split dump into 26 part (A to Z)

it has been done for several hundreds years to manage the huge sizes of encyclopedias.It would be more manageble by users to download. And maybe it helps in creating the dump images too. Paragraph 2 states "The dump process has had difficulty scaling to cope with the size of the very largest wikis". Can you provide more info on what kind of difficulties where encountered ? Why can't "traditional" splitting into A - Z help the dump process? 00:56, 3 February 2011 (UTC) — Preceding unsigned comment added by Luky83 ( talkcontribs)

AFAIK the main problem was that creating the dumps took very long time, which should be hopefully solved now. And your proposal wouldn't help that. Also, there are several reasons why that would be a bad it idea:
  1. People would have to download 26 files instead of one. That's needlessly tedious.
  2. The dumps aren't read by people (at least not directly), but by programs. There is not much advantage for programs in this division.
  3. Some articles don't start with one of the 26 letters of the English alphabet, but that could be easily solved by a 27th part.
Svick ( talk) 13:39, 10 February 2011 (UTC)

Full English Dumps Failing

This article explains that the full dumps of the english language wikipedia have been failing, but doesn't indicate whether or not the problem is going to be resolved. Can we expect to see a full extract any time soon, or even for this to become routine again? — Preceding unsigned comment added by Jdownie ( talkcontribs) 11:56, 22 April 2011 (UTC)

Copyright infrigements

Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information.

Does anyone else feel that even though it says 'most' and it's probably true copyright infrigements are more common among images the statement is potentially confusing as it seems to suggest copyright infrigements are something unique to images. I was thinking of something like 'or believed'. And perhaps at the end 'Remember some content may be copyright infrigements (which should be deleted).' Nil Einne ( talk) 14:39, 16 January 2011 (UTC) Is there anyway to easily download all of these images? I could write a quick script to look at the tags on each File: page and filter out the non-free images.-- RaptorHunter ( talk) 22:53, 24 April 2011 (UTC)

May 2011 Database Dump.

I have created a new torrent for the May 2011 database dump. This torrent resolves some of the problems with the old torrent. It's also a gigabyte smaller (I cut out all of the Wikipedia: namespace pages) -- RaptorHunter ( talk) 01:57, 29 May 2011 (UTC)

http://thepiratebay.org/torrent/6430796

Why No Wikipedia Torrents?

I noticed everyone whinning about downloads taking up precious wikipedia bandwidth, which makes me wonder why isn't the most popular and largest downloads NOT availabe as a torrent? Project Gutenberg has the option to download ebook CD's and DVD's via torrent. • SbmeirowTalk • 21:39, 21 December 2010 (UTC)

The dumps change quickly, they are generated every month or so, when you are downloading a dump, it gets out-of-date very soon. By the way, you have some not official wiki torrents here http://thepiratebay.org/search/wikipedia/0/99/0 This is being discussed here too. emijrp ( talk) 13:56, 23 December 2010 (UTC)
So what if the dumps change quickly? Just make publishing a new .torrent file part of the procedures for posting the bz2 files. It's not rocket science.
I suggest you start a dialog with some of the linux distributions (debian, fedora, et al) - they manage to distribute their ISOs via torrents with great success. I believe the linux ISOs are of comparable size and frequency to wikipedia dumps.
As for unofficial torrents going up on thepiratebay - it doesn't matter how wikipedia distribute the dumps there will always be someone who attempts to publish their own torrents. Do what the linux distribution people do - ignore them.
Currently I'm getting a download rate of 11Kb/sec on these dumps. If you made them available as torrents you would have near to no bandwidth problems. —Preceding unsigned comment added by 203.214.66.249 ( talk) 23:46, 16 February 2011 (UTC)
I agree 100% with the above comment. I'm sure it wouldn't be difficult to find people to get involved as part of the torrent swarm. • SbmeirowTalk • 02:21, 17 February 2011 (UTC)
I haven't heard from anyone from WMF complaining about dumps bandwidth, so I think this is a solution to a problem that doesn't exists. Svick ( talk) 16:52, 19 February 2011 (UTC)
Bandwidth isn't my issue, it is the time it takes ME to download through the wikipedia TINY STRAW is the issue! • SbmeirowTalk • 21:32, 19 February 2011 (UTC)
Ah, sorry I misread. That's odd, when I tried downloading a dump just now, the speed was between 500 KB/s and 1 MB/s most of the time. And I didn't have problems with slow downloads in the past either. Svick ( talk) 21:59, 19 February 2011 (UTC)
Another advantage to torrents is being able to download the dump on an unreliable network. — Preceding unsigned comment added by 61.128.234.243 ( talk) 03:00, 10 June 2011 (UTC)

Dump

Is there a dump of wikipedia articles without the brackets. Clean text without the brackets links and references like [[ ]] and <refences>s? Clean and plain text. I'm just not sure if I find this information in the article? Thanks. 71.33.206.156 ( talk) 10:24, 5 March 2011 (UTC)

It should be easy to run something like sed or a python script on the database dump to strip all of that formatting out. — Preceding unsigned comment added by RaptorHunter ( talkcontribs) 02:47, 11 March 2011 (UTC)
Use this tool: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor — Preceding unsigned comment added by 61.128.234.243 ( talk) 03:42, 10 June 2011 (UTC)

June 2011 Database Dump much smaller than May 2011 Database Dump

It appears that http://download.wikimedia.org/enwiki/20110620/enwiki-20110620-pages-articles.xml.bz2 file is 5.8GB, whereas http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-articles.xml.bz2 is 6.8GB. The big difference between these two files suggest a problem in the dump process. Any idea about what might have happened here? Thanks! — Preceding unsigned comment added by 65.119.214.2 ( talk) 20:28, 7 July 2011 (UTC)

See a message on the xmldatadumps-l mailing list. User<Svick>. Talk() ; 17:38, 8 July 2011 (UTC)

History and logs

Is there any chance of getting just the history pages and logs for the wiki? I am interested in researching editing patterns and behaviors and I don't need any article text (especially not every version of article text). Or is there a better place to get this information? — Bility ( talk) 20:06, 15 August 2011 (UTC)

Yes, go to http://dumps.wikimedia.org/enwiki/latest/ and download the pages-logging.xml.gz and stub-meta-history.xml.gz files only. emijrp ( talk) 22:39, 15 August 2011 (UTC)
Excellent, thank you! — Bility ( talk) 23:57, 15 August 2011 (UTC)

Database Torrent

I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.

I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.

Torrent Link — Preceding unsigned comment added by RaptorHunter ( talkcontribs) 02:58, 11 March 2011 (UTC)

At this exact moment, it looks like there are 15 seeders with 100% of the file available to share via torrent at http://thepiratebay.org/torrent/6234361SbmeirowTalk • 06:19, 25 April 2011 (UTC)

Wikipedia Namespace

I wrote a script to cut out all of the Wikipedia namespace from the database dump. This includes stuff like WP:AFD WP:Village Pump and hundreds of pages of archives that most people don't want. Removing it saves about 1.2GiB or 19% I also cut out the File: MediaWiki: and Help: Namespaces.

Tell me if you think those namespaces should be cut from my next torrent.-- RaptorHunter ( talk) 22:47, 24 April 2011 (UTC)

It would make sense to remove stuff that isn't important so to shrink the size, but unfortunately I don't know what should or shouldn't be in the database. I highly recommend that you contact all vendors / people that have written wikipedia apps to ask for their input. • SbmeirowTalk • 06:11, 25 April 2011 (UTC)
Is there anyway you can re-script this to remove every article except those with the Template name space. Such as Template:Infobox user or Template:War ? I ask because I've been looking to get a hold of just the Template name spaces but every thing I've seen requires several arbitrary steps that are both confusing and not documented well at all. -- RowenStipe ( talk) 10:43, 19 September 2011 (UTC)
I have posted the script I use on pastebin. http://pastebin.com/7sg8eLeX
Just edit the bad_titles= line to cut whatever namespaces you want. The usage comment at the top of the script explains how to run it in linux. — Preceding unsigned comment added by 71.194.190.179 ( talk) 22:22, 29 December 2011 (UTC)

Text Viewer

I have the latest dump of wikipedia articles as .xml file in hope to sort, and use some only on a few topics, then upload to a mysql database as a supplement to an online dictionary.

Which is the best text editor program people use for the 33 GB xml text file (UltraEdit / VIM)? I only have 8 GB ram on this computer- not sure if it will just load whatever part I'm currently viewing or use scratch disk or just crash.

Also, the filename is enwiki-latest-pages-articles.xml.bz2.xml.bz2.xml - I used 7 zip to expand the first level, but what would I use to uncompress next? Isn't xml already uncompressed?

Thanks to anyone who is familiar with this. -- Adam00 ( talk) 16:37, 13 October 2011 (UTC)

It's not very rich in features (it can run macros) but textpad just opens the files as if they are 1 kb txt. When you scroll it loads that part of the file. If you scroll a lot it will fill up the ram. 84.106.26.81 ( talk) 23:45, 1 January 2012 (UTC)
I used 'less' (as in the linux command). It opens the 34GB file instantly, and never uses more than a hundred KB of ram even though I have been scrolling and jumping around the buffer for a couple hours now. Jimreynold2nd ( talk) 05:32, 29 February 2012 (UTC)

Multistream dump index is broken

The multistream dump used a signed 32bit variable as the offset, and as such, offsets of articles past 2GB overflowed and are negative numbers. Did anybody notice this? Jimreynold2nd ( talk) 05:32, 29 February 2012 (UTC)

I have just noticed that on a 20120104 enwiki dump. Has this already been fixed in newer dumps? 177.28.90.160 ( talk) 22:30, 30 April 2012 (UTC)
Well, here is a small program to fix the offsets, as a temporary solution: http://inf.ufrgs.br/~vbuaraujo/foo/c/fix_enwiki_index_offset.c
It reads the uncompressed index from stdin, and prints the fixed index to stdout. 177.28.90.160 ( talk) 22:57, 30 April 2012 (UTC)

the name of the software used to split the wikipedia dump

Hello, I use BzReader to read Wikipedia offline, the problem is that the size of the English dumps is very big (7.5 GB) and BzReader don't succeed to create the index. On the site we can download the dumps in parts, but as there are many parts (28) the search became very slow. I search the name of the software they use to split the dump, so I will split it just in two parts. Thank you and sorry for my English. Rabah201130 ( talk) 10:52, 21 May 2012 (UTC)

i would like to change the file name created when I click "Download as PDF"

Right now, I always get a file: collection.pdf I want the name to be something like: wiki-'title'.pdf — Preceding unsigned comment added by 96.255.124.207 ( talk) 23:53, 11 June 2012 (UTC)

Wikipedia uses the "Collection" extension for MediaWiki to generate the PDF. See http://www.mediawiki.org/wiki/Extension:Collection and http://www.mediawiki.org/wiki/Extension:Collection/Wishlist I'm NOT an expert on the subject, but just know this is what Wikipedia uses. See http://en.wikipedia.org/wiki/Special:Version for proof. • SbmeirowTalk • 04:39, 12 June 2012 (UTC)
How can we change the title of the pdf to correspond to the article? Is there a more relevant project page I should ask this on? — Preceding unsigned comment added by 96.255.124.207 ( talk) 01:05, 13 June 2012 (UTC)
Possibly at http://www.mediawiki.org/wiki/Extension_talk:CollectionSbmeirowTalk • 04:06, 13 June 2012 (UTC)

Accessing an editable wiki offline?

The database download methods described here allow us to read a copy of the wiki offline. Is there a way to access an offline copy of the wiki that not only can be read as if it were the real thing, but also edited as if it were the real thing? I'm not asking for a distributed revision control system, in which the wiki would automatically update as individual users make their own edits. Rather, I'm wondering if it's possible to have an editable offline copy of the wiki, that someone could edit, only being able to modify the real wiki if manually copying changes from the offline version to the online one, article by article. 201.52.88.3 ( talk) 02:50, 16 June 2012 (UTC)

Dump strategy and file size

Hi. I notice that the monthly dump of enwiki "Articles, templates, media/file descriptions, and primary meta-pages" (e.g. as found at dumps.wikimedia.org/enwiki/20120403) is available as 27 XML files (e.g. starting with "enwiki-20120403-pages-articles1.xml-p000000010p000010000.bz2" and ending with "enwiki-20120403-pages-articles27.xml-p029625017p035314669.bz2").

  • What is the strategy use to determine which article ends up in which file?
  • Why is file 27 considerably larger than the other files? Will 27 continue to grow, or will it spill over into a 28th file?

Thanks in advance for your assistance. GFHandel    21:41, 24 April 2012 (UTC)

Hi again. Just wondering if anyone here is able to answer my questions? Thanks in advance if anyone can. GFHandel    21:22, 27 July 2012 (UTC)

E-book reader format?

I'd like to have a dump of wikipedia that is:

  1. readable offline
  2. with images (optional)
  3. readable on a e-book reader (like the Amazon Kindle or the Nook)

Yes I know about the WikiReader, but i don't be stuck reading *only* wikipedia with the device. :)

Is that a realistic expectation, or should i just get a normal ebook reader with wifi capabilities and read wikipedia from there? And if so, which one has satisfactory rendering? -- TheAnarcat ( talk) 14:32, 26 July 2012 (UTC)

Just the science articles please...

Is there anyway to break out and download just the science articles? Just math, science, history, biographies of those involved? I know the sample detection wouldn't be perfect, but I want a lighter-weight dump without all the pop culture, sports, music, etc. Anyone already done this? 24.23.123.92 ( talk) 21:14, 27 July 2012 (UTC)

Article namespace only

I'd like to just get article namespace as I have download usage limits. Regards, Sun Creator( talk) 00:03, 27 September 2012 (UTC)

all Wikipedia on torrent

It's possible to make a torrent with all (images, articles) wikipedia ? Maybe in torrent parts.

How Many TB ? -- CortexA9 ( talk) 15:56, 1 November 2012 (UTC)

Wikipedia versions missing?

This is a very useful resource, but are there some versions missing? I am especially looking for the http://fiu-vro.wikipedia.org (Võru) one (boasting 5090 articles), but with this one missing there may be others as well. Any clue, anyone? Trondtr ( talk) 16:28, 28 January 2013 (UTC).

Image dumpds are defective or inaccesible, Images torrent tracker doesn't work either

All links to image dumps in this article are bad, all bit torrents are not working. As of now there is no way to download the images other then scraping the article pages. This will choke wiki bandwith but looks like people have no other choice. I think sysops should take a look at this, unless of cours this is intentional,.... is it ??? its still not working?? please have a look — Preceding unsigned comment added by 116.73.4.171 ( talk) 13:49, 9 March 2013 (UTC)

Supporting multiple wikipedia languages concurrently

I often use Wikipedia as a dictionary and if I work online I can find an article in the English Wikipedia and then use the link on the left side to jump to another language. How can I do this offline? For example. if I use WIkitaxi I can just load one database and its not possible to jump to another language version of the current article.

Information about which articles discuss more or less the same topic but in a different language ("interlanguage links") has recently been moved to Wikidata. As with other WikiMedia projects, database dumps are available; see Wikidata:Database_download. - TB ( talk) 08:43, 5 June 2013 (UTC)

Total size over time?

I think it would be very interesting to explain what all the files add up to presently, and how it has changed over time. Specifically, I am thinking that Moore's law type effects on computer memory probably exceed the growth of the database, so that at some predictable future time it will be easy for most people to have the entire database in a personal device. Could you start a section providing these statistics for us, just as a matter of general interest? Thanks. Wnt ( talk) 17:03, 18 March 2013 (UTC)

There's lots of statistical information about Wikipedia available at Wikipedia:Statistics. Most relevant to your enquiry is probably Wikipedia:Size of Wikipedia. In a nutshell, it's entirely possible to build yourself a (read-only) Hitchhikers Guide to the Galaxy today - its just a bit expensive and slow. - TB ( talk) 08:53, 5 June 2013 (UTC)
From Wikipedia, the free encyclopedia
Archive 1 Archive 2 Archive 3

Static HTML dumps page always down !

http://static.wikipedia.org/
Hello, the Static HTML dumps download page is never working. Is there a non-wikipedian mirror where we can download the complete HTML version of en.wiki (and fr:wiki as well)? Or maybe a torrent? 13:16, 5 September 2007 (UTC)

Database backups of enwiki keep getting cancelled

The enwiki database dumps seem to either get cancelled or fail. Surely considering the importance of English wikipedia this seems like a critical problem, is anyone working on fixing it? Looks like there hasn't been a proper backup of enwiki for over 2 months. This needs escalating, but there are no obvious ways of doing so. -- Alun Liggins 19:51, 5 October 2007 (UTC)

I spoke to someone on the wikitech channel on IRC and they told me that the problem is on someone's ToDo list, but that it's a 'highly complicated' problem. I don't know what this means in terms of its prospects for getting done. Maybe we can find out who is responsible for moving forward with this? I wonder if there is something we can do to help in the meantime. Bestchai 04:44, 6 October 2007 (UTC)
Thanks Bestchai, Looking at the backups, there are other backups after the "every edit ever" backup that just includes the current pages, surely it would be better to skip to those, so at least we'd have the last edit rather than nothing at all in the event of a disaster. Reading Wikipedia Weekly episode 31 it appears that the dumps just mysteriously fail after a while. Maybe these aren't the only backups they do, and they do tape/disk copies of the main datafiles too? Anyone know? If this is the only backup then this should be the absolute number 1 priority for the Wikipedia foundation to get this fixed today. -- Alun Liggins 13:57, 6 October 2007 (UTC)
Complete absence of any feedback from the administrators, perhaps I've just not looked in the correct place, or they are avoiding the issue? -- Alun Liggins 17:40, 12 October 2007 (UTC)
I went on IRC and asked in #wikimedia-tech, here's a log of what I got: http://pastebin.ca/734865 and http://leuksman.com/log/2007/10/02/wiki-data-dumps/ 69.157.3.164 02:40, 13 October 2007 (UTC)
Database dumps now look totally broken/stalled. I've not been able to determine (anywhere) if this is the sole method of backing up the databases. -- Alun Liggins ( talk) 21:20, 13 December 2007 (UTC)
Now the enwiki ones have failed for pages-articles, no one seems concerned or wants to talk about it. The two messages I posted onto the wikitech-l have been deleted where I asked about the backups. I would have thought that backups take precedent over other fluffier things. -- Alun Liggins ( talk) 19:47, 21 December 2007 (UTC)
I've not been able to get a viable pages-meta-history.xml.7z file for at least 4 months. I think this is a reasonably serious issue. Is anyone out there maintaining an archive of successful full backups, or at least a list of the successful backups for enwiki? Is anyone actually looking at this problem? Aidepkiwi ( talk) 17:52, 10 January 2008 (UTC)

Where can I download an enwiki XML dump ?

Hello, could someone who knows about this please post the link to the exact file, pages-articles.xml.bz2, for me please. The more recent the better. I did read the explanations but when I go to http://download.wikimedia.org/enwiki/ I can't download anything useful. When I get to http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml I then click http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2 but I get error 404. Are there any older versions I can download ? Thanks in advance. Jackaranga 01:00, 20 October 2007 (UTC)

Ok, no error 404 anymore problem solved. Jackaranga 13:40, 23 October 2007 (UTC)

a simple table!

Hi, I'm editing the article MS. Actually, it's more of a disambiguation page or a list. There is a little debate there about how we could format the list. I couldn't help but think of my good old database programming class. There we could pick and chose how we wanted the information to be displayed. I think it would be really handy to be able to make a table and sort it in various ways and have it display on wikipedia the way that the final user would like. For example, the article could sort the list by MS, mS, Ms, M.S., etc... or By category: medical, Aviation, etc..., then by alphabetical, etc...? I can't place 2 and 2 toghether on how SQL and a regular articles (wikipedia's database) technologies could be implemented together. -- CyclePat ( talk) 03:28, 26 November 2007 (UTC)

Main namespace dump

Are there any dumps of only the main namespace? It would be simple to do on my own, but it would be time consuming and memory intensive, and it seems like this would be something useful for other users. My parser is getting bogged down on old archived Wikipedia namespace pages which aren't particularly useful for my purposes, so it would be nice to have only the actual articles. Thanks. Pkalmar ( talk) 01:58, 21 December 2007 (UTC)

Arrrrgh, I can't read/find current dumps of the file

I can't seem to be able read the file on my computer. Any help? And were is the current dump, I accidently downloaded a old one. ( TheFauxScholar ( talk) 03:27, 4 April 2008 (UTC))

Inability to look after the database dumps

You would think the wikimedia foundation, with all the funding it gets, would be able to actually deliver the part of the open sources license that dictates that the "source" (i.e. dumps of the database) actually happen. Currently they constantly violate this, then shout and scream (as is the wikipedia way) at people who ask why there are no recent dumps. Hopefully someone will make a fork and run it properly, oh, but hang on, the wikimedia foundation "seem" to almost deliberately restrict access to the pictures.... so no fork of wikipedia then....! —Preceding unsigned comment added by 77.96.111.181 ( talk) 18:33, 18 May 2008 (UTC)

Static HTML dump

Does anyone know why no static HTML dump is available? I also asked this question here. Bovlb ( talk) 23:11, 22 May 2008 (UTC)

Big silence. I see that people have also been asking on wikitech, but there's been no comment since 2008-03-06, when it was a week away. Is there somewhere else I should be asking? Bovlb ( talk) 23:21, 30 May 2008 (UTC)

Last enwiki dump deleted?

Where is the last worked full history dump?

http://www.mail-archive.com/dbpedia-discussion@lists.sourceforge.net/msg00135.html

http://download.wikimedia.org/enwiki/20080103/ —Preceding unsigned comment added by 93.80.82.194 ( talk) 19:07, 15 June 2008 (UTC)


where are mentions of the API?

am i insane, or does this article completely ignore user options with the mediawiki API? http://www.mediawiki.org/wiki/API Aaronbrick ( talk) 00:22, 8 October 2008 (UTC)

Hi there, f i download static html dumps how can i later organize them? I mean how can i browse thse articles in some gui or do i need just to browse it in windows directories? —Preceding unsigned comment added by 62.80.224.244 ( talk) 09:54, 20 October 2008 (UTC)

Scripted access to Special:Export forbidden?

Hi all. I was wanting to access a subset of pages for the purposes of doing wordcount anaylsis on them. Maybe a couple hundred all told. I have tried to access these pages in XML via Special:Export (i.e. http://en.wikipedia.org/wiki/Special:Export/PAGE_NAME) but received a 403 back. I looked at Wikipedia's robots.txt but cannot find any rule for Special:Export itself. Anyone know why I might be failing here? AFAIK my IP is not banned or anything like that - it's a normal server-to-server request. —Preceding unsigned comment added by Nrubdarb ( talkcontribs) 11:08, 9 November 2008 (UTC)

enwiki dumps failing?

24 January 2009: the newest available dump (with only the articles, the archive most mirror sites will probably want.) for the English Wikipedia enwiki is the version enwiki-20081008-pages-articles.xml.bz2 from October 2008. From the outside it looks as if a dump in progress of pages-meta-history.xml.bz2 blocks further exports. Is there a way to report this problem? —Preceding unsigned comment added by 92.75.194.40 ( talk) 17:22, 24 January 2009 (UTC)


There is an enwiki dump currently processing at the moment, but the previous one failed a number of jobs: [1]

In particular:

2007-04-02 14:05:27 failed Articles, templates, image descriptions, and primary meta-pages. This contains current versions of article content, and is the archive most mirror sites will probably want. pages-articles.xml.bz2

I can't seem to find any older dumps either...

67.183.26.86 17:56, 5 April 2007 (UTC)

I'm also trying to make a local mirror for use while traveling, but can't find a functioning older dump. Any word on when the next dump will be out?

Looks like with the failures, they've done another a day later here. That is 02-04-2007, whereas the other was 01-04-2007. The pages-articles.xml.bz2 is done 2007-04-07 12:59:23 done Articles, templates, image descriptions, and primary meta-pages link. Reedy Boy 15:27, 11 April 2007 (UTC)

I'm a little worried: I am a PhD student whose research is completely dependent on this data. With the last 2 dumps failing, and others being removed, there isnt a single complete dump of the english wikipedia data left available for download - and there hasn't been a new dump for almost a month. Has something happened? -- 130.217.240.32 04:47, 27 April 2007 (UTC)

A full, 84.6GB (compressed) dump has finally completed [2] (PLEASE, whoever runs the dumps, keep this one around at least until a newer version is complete, such that there's always at least one good, full dump). Regarding research, it's been suggested to dump random subsets in the requests for dumps on the meta site, but nothing's happened regarding it as of yet AFIACT.-- 81.86.106.14 22:38, 8 May 2007 (UTC)

Could someone do a statistical analysis of article referencing?

Over at Wikipedia:Improving referencing efforts we're trying to improve article referencing. It would be helpful if we had statistics on the proportion of articles with references, average references per article, etc. Does anyone have a bot that could help with this? Thanks. - Peregrine Fisher ( talk) ( contribs) 21:49, 15 December 2008 (UTC)

Access to wikipedia search queries

Can we get access to the wikipedia search queries? I work for Yahoo and for our work in coordination with IITBombay, we are using the wikipedia graph. Please let me know at manish_gupta12003 [at] yahoo.com if we can have an access to a small subset of such queries (without any session info or user info). Thanks! —Preceding unsigned comment added by 216.145.54.7 ( talk) 07:36, 28 April 2009 (UTC)

Missing Articles

I'm finding articles missing from the dumps.

Lets take a look at the first article in the dump:

AmericanSamoa (id 6)
#REDIRECT [[American Samoa]]{{R from CamelCase}}

Looking for redirect (American Samoa) in the dump:

grep -n ">American Samoa" enwiki-20090610-pages-articles.xml
72616: <title>American Samoa/Military</title>
6006367: <title>American Samoa/Economy</title>
7298422: <title>American Samoa/People</title>
7849801: <title>American Samoa/Geography</title>
7849816: <title>American Samoa/Government</title>
7849831: <title>American Samoa/Communications</title>
7849846: <title>American Samoa/Transportation</title>
44729785: <title>American Samoa at the 2004 Summer Olympics</title>
46998496: <title>American Samoan</title>
47801378: <title>American Samoa national football team</title>
65053327: <title>American Samoa territory, United States</title>
69169386: <title>American Samoa at the 2000 Summer Olympics</title>
74281114: <title>American Samoa Community College</title>
75011596: <text xml:space="preserve">American Samoa<noinclude>
79006314: <title>American Samoa national rugby league team</title>
97301557: <title>American Samoa National Park</title>
108306703: <title>American Samoa Territory Constitution</title>
119786420: <title>American Samoa at the 1996 Summer Olympics</title>
120802909: <title>American Samoa Fono</title>
120802959: <title>American Samoa House of Representatives</title>
120803207: <title>American Samoa Senate</title>
124460055: <title>American Samoa's At-large congressional district</title>
129254992: <title>American Samoan territorial legislature</title>
138027379: <title>American Samoa Football Association</title>

Can anyone shed any light on this? -- Smremde ( talk) 12:01, 25 June 2009 (UTC)

For reference, I've listed below all articles from the toolserver database wiht titles beginning "American Samoa" I'd expect to see in the dump. This does indeed show more titles (51 vs 25) - your grep has picked out the first 25 the titles. Possible causes - is your dump file incomplete, perhaps due to a (4GB?) file-size limit on your machine ? If not is grep on your system balking at the large file and giving up partway through ?- TB ( talk) 21:28, 25 July 2009 (UTC)
mysql> select page_id, page_title from page where page_namespace = 0 and page_title LIKE 'American_Samoa%' ORDER by 1 ASC;
+----------+-------------------------------------------------------------+
| page_id  | page_title                                                  |
+----------+-------------------------------------------------------------+
|     1116 | American_Samoa/Military                                     |
|    57313 | American_Samoa/Economy                                      |
|    74035 | American_Samoa/People                                       |
|    82564 | American_Samoa/Geography                                    |
|    82565 | American_Samoa/Government                                   |
|    82566 | American_Samoa/Communications                               |
|    82567 | American_Samoa/Transportation                               |
|   924548 | American_Samoa_at_the_2004_Summer_Olympics                  |
|   997978 | American_Samoan                                             |
|  1024704 | American_Samoa_national_football_team                       |
|  1687185 | American_Samoa_territory,_United_States                     |
|  1843552 | American_Samoa_at_the_2000_Summer_Olympics                  |
|  2058267 | American_Samoa_Community_College                            |
|  2262625 | American_Samoa_national_rugby_league_team                   |
|  3168052 | American_Samoa_National_Park                                |
|  3754535 | American_Samoa_Territory_Constitution                       |
|  4442190 | American_Samoa_at_the_1996_Summer_Olympics                  |
|  4504211 | American_Samoa_Fono                                         |
|  4504216 | American_Samoa_House_of_Representatives                     |
|  4504225 | American_Samoa_Senate                                       |
|  4738483 | American_Samoa's_At-large_congressional_district            |
|  5072984 | American_Samoan_territorial_legislature                     |
|  5656610 | American_Samoa_Football_Association                         |
|  7499304 | American_Samoa_at_the_1988_Summer_Olympics                  |
|  7821424 | American_Samoa_women's_national_football_team               |
|  7855525 | American_Samoa_at_the_1994_Winter_Olympics                  |
|  7855530 | American_Samoa_at_the_1992_Summer_Olympics                  |
|  7873035 | American_Samoa_Territorial_Police                           |
|  8213414 | American_Samoa_Department_of_Education                      |
|  8233086 | American_Samoan_general_election,_2008                      |
|  9682159 | American_Samoa_Constitution                                 |
| 10158999 | American_Samoa_national_rugby_union_team                    |
| 11944957 | American_Samoan_general_election,_2004                      |
| 11944974 | American_Samoan_legislative_election,_2006                  |
| 11944988 | American_Samoan_general_election,_2006                      |
| 12095118 | American_Samoa_national_soccer_team                         |
| 12225196 | American_Samoa_national_football_team_results               |
| 12869869 | American_Samoa_at_the_2007_World_Championships_in_Athletics |
| 12999445 | American_Samoa_national_basketball_team                     |
| 14938467 | American_Samoa_at_the_Olympics                              |
| 15587944 | American_Samoa's_results_and_fixtures                       |
| 15611124 | American_Samoa_Democratic_caucuses,_2008                    |
| 15641610 | American_Samoa_Republican_caucuses,_2008                    |
| 15681703 | American_Samoa_Republican_caucuses                          |
| 15929695 | American_Samoa_Territory_Supreme_Court                      |
| 16089779 | American_Samoa_at_the_2008_Summer_Olympics                  |
| 19468499 | American_Samoa_Governor                                     |
| 19785703 | American_Samoa_Supreme_Court                                |
| 19850654 | American_Samoa_gubernatorial_election,_2008                 |
| 19913142 | American_Samoa_Power_Authority                              |
| 20611195 | American_Samoa                                              |
+----------+-------------------------------------------------------------+
51 rows in set (0.00 sec)

Abstracts not available

I was trying to donwload the articles´ abstracts of the english wikipedia and just realized that the files are just a few KB big and only contain a couple of abstracts. I chequed older versions of the dump and dumps in other languages and there is the same situation. Does anyone know how to fix the dump of the articles´abstracts or know any other way to get those abstracts? —Preceding unsigned comment added by 141.45.202.255 ( talk) 10:07, 15 September 2009 (UTC)

which dumps contain redirects?

Possibly a dumb question, but which enwiki dumps contain redirects (i.e. entry including title & text). It would seem my "enwiki-20090902-pages-articles" doesn't have them? Thanks Rjwilmsi 23:13, 20 January 2010 (UTC)

It was a dumb question, the "pages-articles" XML dump does contain them. By default AWB's database scanner does not search redirects, but this default can be switched. Rjwilmsi 15:36, 2 February 2010 (UTC)

OLD data

- I've installed WIkimedia script

- downloaded and installed all tables with exclusion of OLD_TABLES...

Wikimedia don't works! I need absolutely the old_tables or not??

Thanks

Davide

—Preceding unsigned comment added by 80.116.41.176 ( talk) 19:27, 1 July 2005 (UTC)

Dump frequency and size

Recent mailing list posts [3] [4] [5] indicate developers have been busy with the Mediawiki 1.5 update recently, but that smaller, more specific dumps can now more easily be created. I'm hoping will see an article-namespace-only dump. The problem with talk pages is that the storage space required for them will grow without bound, whereas we can actually try to reduce the storage space needed for articles - or at least slow growth - by merging redundant articles, solving content problems, etc. Such a dump would also make analysis tools that look only at encyclopedic content (and there are a growing number of useful reports - see Wikipedia:Offline reports and Template:Active Wiki Fixup Projects) run faster and not take up ridiculously large amounts of hard drive space (which makes it more difficult to find computers capable of producing them).

User:Radiant! has been asking me about getting more frequent updates of category-related reports for a "Categorization Recent Changes Patrol" project. This is difficult to do without more frequent database dumps, and I'm sure there are a number of other reports that could benefit from being produced more frequently, or at least containing more up to date information. (And less human editor time would be wasted as a result, we hope.)

I'm actually surprised that the database dumps must be produced manually; I'm sure there's an engineering solution that could automate the process, reducing the burden on developers. I hope the developers will be able to deal with performance and other issues to be able to do this in the near future, so we can stop nagging them about database dump updates and get on with the work of fixing up the wiki. -- Beland 03:48, 8 July 2005 (UTC)

A small and fast XML dump file browser

Check WikiFilter.

It works with all current wiki project dump files in all languages. You do not need PHP or Mysql. All you need is a web server like Apache, and then you can view a wiki page through your web browser, either in normal html format, or in the raw wikitext format.

Rebuilding an index data-base is also reasonably fast. For example, the 3-GB English Wikipedia takes about 10 minutes on a Pentium4. Wanchun ( Talk) 06:35, 20 September 2005 (UTC)

Subsets of Image Dumps

Is there a good, approved way to download a relatively small number of images (say ~4000)? The tar download is kinda slow for this purpose. It'd be faster to web crawl (even with the 1-second delay) than to deal with the tar. But that seems disfavoured.

-- BC Holmes 21:07, 12 June 2006 (UTC)

Downloading (and uploading) blocked IP addys

Hi, I recently created a wiki for sharing code snippets and I've started to get some spam bots adding nonsense or spam to the articles. Rather than block them one at a time, I was thinking it'd be easier to preemptively use Wikipedia's IP block list and upload it into my wiki.

So, my question is, how do I download the block list -- and -- how do I upload it to my wiki? Thanks! - 206.160.140.2 13:49, 29 June 2006 (UTC)

Kiwix

If you are looking for an offline copy of wikipedia, I refer you to kiwix and a howto. Wizzy 08:55, 27 April 2010 (UTC)

Dump from 20100130 vs 20100312 (enwiki).

Why the enwiki dump completed in April/2010:

20100312 # 2010-04-16 08:46:23 done All pages with complete edit history (.7z) Everything is Ok * pages-meta-history.xml.7z 15.8 GB

is two times smaller than the previous dump completed in March/2010?

20100130 # 2010-03-26 00:57:06 done All pages with complete edit history (.7z) Everything is Ok * pages-meta-history.xml.7z 31.9 GB

-- Dc987 ( talk) 23:09, 29 April 2010 (UTC)

Read this section. It looks like all the 7z files are broken. The full dump is 280 GB. Rjwilmsi 07:05, 30 April 2010 (UTC)
Uh. I see. "Please note that more recent dumps (such as the 20100312 dump) are incomplete." -- Dc987 ( talk) 06:26, 1 May 2010 (UTC)
The current English Wikipedia dump is available in bz2 (280 GB) and 7z (30 GB). 7z size is so low due to its higher compress ratio. emijrp ( talk) 11:38, 16 August 2010 (UTC)

Downloading templates

Is there somewhere where I can download some of the templates that are in use here on wikipedia?

—Preceding unsigned comment added by 74.135.40.211 ( talk)

You probably want the pages-articles.xml.bz2. It has Articles, templates, image descriptions, and primary meta-pages. For example Reedy Boy 14:55, 13 February 2007 (UTC)
The above link returns 404. Is there somewhere else I can download all the templates used on Wikipedia? -- Magick93 11:22, 1 August 2007 (UTC)


here Wowlookitsjoe 18:40, 25 August 2007 (UTC)

Judging by the file name, this is all Wikipedia articles, and not just the templates - is that correct Wowlookitsjoe? -- Magick93 15:38, 1 September 2007 (UTC)

I am looking to download Wiki templates as well. What else will I need to work with it? FrontPage? Dreamweaver? —Preceding unsigned comment added by 84.59.134.169 ( talk) 19:31, August 27, 2007 (UTC)

I am looking for a way to download just the templates too. I am trying to import articles from Wikipedia that use templates, and nested templates preventing the articles from displaying correctly. Is there any way to just exclusively get the templates? —Preceding unsigned comment added by Coderoyal ( talkcontribs) 03:20, 15 June 2008 (UTC)

Yes agreed, even just a way to download the basic templates that you need to make an infobox and a few other things to get you started would be great.-- Kukamunga13 ( talk) 22:44, 8 January 2009 (UTC)

Yea i was searching for exactly for that described here. could anyone help? cheers

http://download.wikimedia.org/enwiki/20090713/ 207.250.116.150 ( talk) 18:59, 24 July 2009 (UTC)

Found this discussion aswell when trying to export templates - further googling revealed there is a method using special:export and specifying the articles you want to export, these can then be imported with special:import or the included XML import utilities. -- James.yale ( talk) 08:53, 4 September 2009 (UTC)

I, too, would LOVE a way to download templates without having to wade through some truly disparate garbage that some people (thankfully not all) call "documentation." This is the biggest flaw in Wikipedia - crappy documentation that doesn't tell newbies how to find much of ANYthing. Liquidfractal ( talk) 01:26, 18 August 2010 (UTC)

Hi all. I'm a bit worried about all this raw data we are writing every day, aka Wikipedia and wikis. Due to the closing of Geocities by Yahoo!, I learnt about Archive Team (it's a must see). Also, I knew about Internet Archive. Webpages are weak, hard disks can be destroyed in accidents, ect.

Wikimedia Foundation offer us dumps for all the projects, and I'm not sure how many people download them all. Some days ago I wrote a small script (I can share it, GPL) which downloads all the backups (only the 7z files, about ~100GB, I think they are the most important). Are you interested in get a copy of all the Wiki[mp]edia public data? I don't know about public mirrors of Wikipedia (although there are many for Linux ISOs). There are no public dumps of Nupedia or their mailing lists, only available through Internet Archive. [6] [7]. Today, that is history.

I wrote a blog post about this. I think that this is very important, we are writing all the human knowledge, and it needs to be saved in every continent, every country around the globe. Please, write your comments and suggestions. Regards. emijrp ( talk) 12:05, 16 August 2010 (UTC)

I have move it to User:Emijrp/Wikipedia Archive. emijrp ( talk) 08:41, 10 September 2010 (UTC)

freecache.org

Would adding http://freecache.org to the database dump files save bandwidth? Perhaps this could be experimented on for files < 1Gb & > 5Mb - Alterego

FreeCache has been deactivated. Edward ( talk) 23:06, 19 October 2010 (UTC)
    • Also, in relation to incremental updates, it seems that the only reason freecache wouldn't work is because old files aren't accessed often so they aren't cached. If some method could be devised whereas everyone who needed incremental updates accessed the files within the same time period, perhaps via an automated client, you could better utilize the ISPs bandwidth. I could be way off here. - Alterego

data base size?

huge artiucle, lots of information but not a single innuedno as to what is the approximate size of the wikiepdia data base...Can anyone help me find that out?-- Procrastinating@ talk2me 13:13, 25 December 2006 (UTC)

The last en.wiki dump i downloaded enwiki-20061130-pages-articles.xml expands to 7.93GB Reedy Boy 14:39, 25 December 2006 (UTC)
I think the indexes need an additional 30GB or so. I'm not sure yet because for the latest dump rebuildall.php has not completed after several days. ( SEWilco 00:35, 22 January 2007 (UTC))
there's no information as to the size of the images, sound video, etc. The only data I could find was this link from 2008 which said the images were 420GB. This is probably all of the languages put together, since articles share images. But I agree, the article should include some information about the DATABASE SIZE, since noone is ever going to print off wikipedia on paper. Brinerustle ( talk) 21:06, 22 October 2009 (UTC)

Wikimedia Commons images: 7M and growing

Some info about the growing ratio of uploaded images every month. Sizes are in MB. The largest bunch of images uploaded was in 2010-07: 207,946 images, +345 GB, 1.6 MB per image average. Regards. emijrp ( talk) 11:36, 16 August 2010 (UTC)

date	imagescount	totalsize	avgsizeperimage
2003-1	1	1.29717636	1.297176361084
2004-9	685	180.10230923	0.262923079163
2004-10	4171	607.82356930	0.145726101486
2004-11	3853	692.86077213	0.179823714543
2004-12	6413	1433.85031033	0.223584954050
2005-1	7754	1862.35064888	0.240179345999
2005-2	8500	2974.20539951	0.349906517590
2005-3	12635	3643.41866493	0.288359213687
2005-4	15358	5223.11017132	0.340090517731
2005-6	18633	6828.90809345	0.366495362714
2005-7	19968	8779.54927349	0.439680953200
2005-11	21977	9490.01213932	0.431815631766
2005-5	27811	10488.68217564	0.377141497092
2005-12	28540	10900.89676094	0.381951533320
2005-8	29787	12007.92203808	0.403126264413
2005-9	28484	13409.05960655	0.470757604499
2006-2	36208	14259.63428211	0.393825515966
2006-1	35650	14718.57859612	0.412863354730
2005-10	31436	16389.36539268	0.521356578212
2006-3	41763	18516.05062675	0.443360166338
2006-4	46645	21974.19114399	0.471094246843
2006-5	49399	28121.86408234	0.569280027578
2006-6	50893	28473.23626709	0.559472545676
2006-12	52577	29190.01748085	0.555186060080
2006-7	58384	30763.06628609	0.526909192349
2006-10	59611	32019.11058044	0.537134263482
2006-11	57070	32646.13846588	0.572036770035
2006-9	64991	35881.70751190	0.552102714405
2007-2	64787	37625.21335030	0.580752517485
2007-1	71989	38771.36503792	0.538573463139
2006-8	78263	48580.74944115	0.620737122793
2007-3	97556	51218.63132572	0.525017746994
2007-6	92656	60563.04164696	0.653633241743
2007-4	116127	69539.80562592	0.598825472336
2007-5	94416	69565.31412792	0.736795819860
2007-12	96256	72016.71782589	0.748179000020
2007-7	110470	72699.43968487	0.658092148863
2008-2	103295	73830.05222321	0.714749525371
2007-11	118250	78178.20839214	0.661126498031
2008-1	114507	80664.45367908	0.704449978421
2008-3	115732	84991.75799370	0.734384249764
2007-10	112011	85709.30096245	0.765186463494
2007-8	120324	87061.07968998	0.723555397842
2008-4	102342	93631.80365562	0.914891282715
2007-9	125487	95482.06631756	0.760892094939
2008-6	101048	95703.97241211	0.947113969718
2008-11	120943	110902.88121986	0.916984705356
2008-5	116491	112381.90718269	0.964726091996
2008-9	134362	114676.82158184	0.853491475133
2008-10	114676	116692.37883282	1.017583267927
2009-2	127162	119990.37194729	0.943602427984
2008-12	193181	120587.25548649	0.624219025093
2009-1	126947	121332.83420753	0.955775514250
2008-7	119541	122324.61021996	1.023285820095
2008-8	119471	124409.18394279	1.041333745786
2010-8	89776	164473.27284527	1.832040554773
2009-9	132803	176394.38845158	1.328240991932
2010-2	158509	182738.31155205	1.152857639327
2009-5	143389	182773.11001110	1.274666187860
2009-6	160070	186919.01145649	1.167732938442
2009-11	178118	188819.04353809	1.060078394874
2009-4	196699	202346.40691471	1.028710908112
2010-3	178439	206399.28073311	1.156693776210
2010-1	361253	216406.85650921	0.599045147055
2009-12	143797	217440.78340054	1.512137133602
2009-7	164040	230185.27826881	1.403226519561
2009-8	167079	250747.52404118	1.500772233741
2009-3	209453	260923.17462921	1.245736153835
2010-6	173658	270246.74141026	1.556200931775
2010-4	208518	297715.36278248	1.427768167652
2010-5	203093	297775.34260082	1.466201900611
2009-10	272818	329581.74736500	1.208064524207
2010-7	207946	345394.14518356	1.660979990880
This is 6.95788800953294 Tb in total. Rich  Farmbrough, 15:43, 17 November 2010 (UTC).

Another offline Wikipedia idea...But would this be acceptable (re)use of Wikipedia's data?

Hi everyone! :-)

After reading the Download Wikipedia pages earlier today, a few ideas have popped into my mind for a compacted, stripped down version of Wikipedia that would (Hopefully) end up being small enough to fit on standard DVD-5 media, and might be a useful resource for communities and regions where Internet access is either unavailable or prohibitively expensive. The idea that I've currently got in mind is primarily based on the idea of storing everything in 6-bit encoding (Then BZIPped) which - Though leading to a loss of anything other than very basic formatting and line/paragraph breaks - Would still retain the text of the articles themselves, which is the most important part of Wikipedia's content! :-)

Anyhow, I'm prone to getting ideas in mind, and then never acting upon them or getting sidetracked by other things...So unless I ever get all of the necessary bits 'n' pieces already coded and ready for testing, I'm not even going to consider drinking 7.00GB of Wikipedia's bandwidth on a mere personal whim. That said - Before I give more thought to the ideas in mind - I just wanted to check a few things with more knowledgeable users:

  • I believe that all text content on Wikipedia is freely available/downloadable under the GFDL (Including everything in pages-articles.xml.bz2). Is this correct, or are there copyrighted text passages that I might have to strip out in any such derivative work?
  • Given that my idea centres around preserving only the article texts in their original form - With formatting, TeX notation, links to external images/pages etc. removed where felt necessary - Would such use still fall under the scope of the GFDL, or would I be regarded as having created a plagarised work using Wikipedia's data?
  • Because of the differing sizes of assorted media and the fact that - If I managed to get myself actually doing something with this idea - I'd rather create something that could be scaled up or down to fit the target media, meaning that someone could create a smaller offline Wikipedia to fit on a 2GB pen-drive using a DVD version (~4.37GB) as the source if they so needed. Because such smaller versions would be made simply and quickly by just truncating the index and article files as appropriate, it'd help to have the most useful/popular articles sorted to the start of the article file. Therefore; Are any hit counts (Preferably counting only human visits - Not automated ones) kept against article pages, and - If so - Can these hit counts be downloaded separately, without having to download the huge ~36GB Wiki dump?
  • Finally: For obvious reasons, I wouldn't want to try downloading the whole 6.30GB compressed articles archive until I was certain that I'd created a workable conversion process and storage format that warranted a full-scale test...But having the top 5-10MB of the compressed dump would be useful for earlier testing and bug blasting. Would using Wget to do this on an occasional basis cause any problems or headaches?

Farewell for now, and many thanks in advance for any advice! >:-)

+++ DieselDragon +++ ( Talk) - 22 October 2010 CE = 23:44, 22 October 2010 (UTC)

I don't think a bzipped 6-bit-encoded image would be any smaller than bzipped 8-bit-encoded. Compression is about information theory, and the information in both versions is the same. Wizzy 07:39, 23 October 2010 (UTC)
I was thinking of using 6-bit encoding with a constrained alphabet/character table and possibly a database for common words to preserve the encyclopaedic text on its own (The element that I consider the most important) whilst disposing of less crucial elements such as images, wiki-markup, XML wrappers, and possibly links. This would serve the purpose of stripping out everything bar the core content itself, which should (In theory) reduce the size of the encyclopedia to managable everyday proportions whilst still keeping all of the essential elements. :-)
A simple analogy would be taking a newspaper - Removing all images, adverts, visual formatting and other unnecessary elements - And keeping just the text on it's own...Resulting in a document that's much smaller (In data terms) than the original. Using the encoding that I have in mind, the text-string "Wikipedia" which normally occupies 72 bits on it's own would be reduced to 54 bits. That said, my understanding of compression processes has never reached past the fundamental stage ( LZW gives me enough of a headache!) so I don't know if 6-bit would compress as well as it's 8-bit equivalent. :-)
+++ DieselDragon +++ ( Talk) - 26 October 2010 CE = 14:10, 26 October 2010 (UTC)
I agree with Wizzy that using some custom encoding won't help you much, if at all. Using some other compression algorithm ( 7zip?) would probably give you better results. Also, downloading the dump for yourself, even just to try something out isn't wasting Wikimedia bandwidth – those files are published so that they can be downloaded. Using wget the way you describe should be okay too.
Wikipedia switched licenses, so it's now licensed under CC-BY-SA. That means that you can do pretty much anything with the texts (including changing them in any way) as long as you give proper credit to the authors and release your modifications under the same license.
Svick ( talk) 17:25, 28 October 2010 (UTC)
Have you looked at Wikipedia:Version_1.0_Editorial_Team? Smallman12q ( talk) 22:20, 28 October 2010 (UTC)
  • Text only, articles, redirects, templates and categories would fit on a DVD I trow. If not then the place to look for saving is pulling out HTML comments, spurious spaces, persondata, external links (since we are positing no Internet access), interwiki links, maybe infoboxes, nav-boxes and so on.
  • Best compression is apparently arithmetic encoding, and certainly the 7z/b7 compression seems far more efficient.
  • Hit data is available, it is 50M per hour compressed. I have been thinking about consolidating this, for other reasons. Let me know if you decide to go ahead with this and I will look harder at doing that.

Regards, Rich  Farmbrough, 10:47, 17 November 2010 (UTC).

What is pages-meta-current.xml.bz2

I note that it isn't currently possible to download a data dump due to server maintenance issues, however I did find a dump of a file called enwiki-20091017-pages-meta-current.xml.bz2 on BitTorrent via The Pirate Bay. Is this the same data as pages-articles.xml.bz2?

One further query: has the Wikimedia Foundation considered making its data dumps available via BitTorrent? If you did, it might remove some load from your servers. -- Cabalamat ( talk) 01:40, 14 December 2010 (UTC)

The download server is up now. As far as I know, using BitTorent wasn't considered, presumably because the load from the downloads is not a problem (but there is nothing stopping you or anyone else from creating those torrents). As for your question, the difference between pages-meta-current and pages-articles is that the former contains all pages, while the latter doesn't contain talk pages and user pages. Svick ( talk) 01:34, 20 December 2010 (UTC)

Where is the Dump of this Wikia?

Hi all, a couple of days ago I wrote my problem in the LyricsWiki's Help Desk page but they didn't answer anything! I thought there is a relationship between Wikia projects & Wikipedia, so I decided to ask that here. This is my problem I wrote there:

"Hi, I'm lookin' for this wikia's dump, like any other wikia I went to "Statistics" page, and the Dump file (Which were in the bottom of the page in other Wikias) was not there!! Why there is not any dumps? Isn't it free in all Wiki projects? "

So, in [8] should be a dump of LyricsWiki, but there isn't! Why? Isn't it a free download? -- MehranVB talk | mail 16:14, 21 December 2010 (UTC)

Wikia and Wikipedia aren't really related, so this is a bad place to ask. I'd suggest asking at Wikia's Community Central Forum. Svick ( talk) 22:54, 25 December 2010 (UTC)
Thanks! -- MehranVB talk | mail 06:35, 26 December 2010 (UTC)

split dump into 26 part (A to Z)

it has been done for several hundreds years to manage the huge sizes of encyclopedias.It would be more manageble by users to download. And maybe it helps in creating the dump images too. Paragraph 2 states "The dump process has had difficulty scaling to cope with the size of the very largest wikis". Can you provide more info on what kind of difficulties where encountered ? Why can't "traditional" splitting into A - Z help the dump process? 00:56, 3 February 2011 (UTC) — Preceding unsigned comment added by Luky83 ( talkcontribs)

AFAIK the main problem was that creating the dumps took very long time, which should be hopefully solved now. And your proposal wouldn't help that. Also, there are several reasons why that would be a bad it idea:
  1. People would have to download 26 files instead of one. That's needlessly tedious.
  2. The dumps aren't read by people (at least not directly), but by programs. There is not much advantage for programs in this division.
  3. Some articles don't start with one of the 26 letters of the English alphabet, but that could be easily solved by a 27th part.
Svick ( talk) 13:39, 10 February 2011 (UTC)

Full English Dumps Failing

This article explains that the full dumps of the english language wikipedia have been failing, but doesn't indicate whether or not the problem is going to be resolved. Can we expect to see a full extract any time soon, or even for this to become routine again? — Preceding unsigned comment added by Jdownie ( talkcontribs) 11:56, 22 April 2011 (UTC)

Copyright infrigements

Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information.

Does anyone else feel that even though it says 'most' and it's probably true copyright infrigements are more common among images the statement is potentially confusing as it seems to suggest copyright infrigements are something unique to images. I was thinking of something like 'or believed'. And perhaps at the end 'Remember some content may be copyright infrigements (which should be deleted).' Nil Einne ( talk) 14:39, 16 January 2011 (UTC) Is there anyway to easily download all of these images? I could write a quick script to look at the tags on each File: page and filter out the non-free images.-- RaptorHunter ( talk) 22:53, 24 April 2011 (UTC)

May 2011 Database Dump.

I have created a new torrent for the May 2011 database dump. This torrent resolves some of the problems with the old torrent. It's also a gigabyte smaller (I cut out all of the Wikipedia: namespace pages) -- RaptorHunter ( talk) 01:57, 29 May 2011 (UTC)

http://thepiratebay.org/torrent/6430796

Why No Wikipedia Torrents?

I noticed everyone whinning about downloads taking up precious wikipedia bandwidth, which makes me wonder why isn't the most popular and largest downloads NOT availabe as a torrent? Project Gutenberg has the option to download ebook CD's and DVD's via torrent. • SbmeirowTalk • 21:39, 21 December 2010 (UTC)

The dumps change quickly, they are generated every month or so, when you are downloading a dump, it gets out-of-date very soon. By the way, you have some not official wiki torrents here http://thepiratebay.org/search/wikipedia/0/99/0 This is being discussed here too. emijrp ( talk) 13:56, 23 December 2010 (UTC)
So what if the dumps change quickly? Just make publishing a new .torrent file part of the procedures for posting the bz2 files. It's not rocket science.
I suggest you start a dialog with some of the linux distributions (debian, fedora, et al) - they manage to distribute their ISOs via torrents with great success. I believe the linux ISOs are of comparable size and frequency to wikipedia dumps.
As for unofficial torrents going up on thepiratebay - it doesn't matter how wikipedia distribute the dumps there will always be someone who attempts to publish their own torrents. Do what the linux distribution people do - ignore them.
Currently I'm getting a download rate of 11Kb/sec on these dumps. If you made them available as torrents you would have near to no bandwidth problems. —Preceding unsigned comment added by 203.214.66.249 ( talk) 23:46, 16 February 2011 (UTC)
I agree 100% with the above comment. I'm sure it wouldn't be difficult to find people to get involved as part of the torrent swarm. • SbmeirowTalk • 02:21, 17 February 2011 (UTC)
I haven't heard from anyone from WMF complaining about dumps bandwidth, so I think this is a solution to a problem that doesn't exists. Svick ( talk) 16:52, 19 February 2011 (UTC)
Bandwidth isn't my issue, it is the time it takes ME to download through the wikipedia TINY STRAW is the issue! • SbmeirowTalk • 21:32, 19 February 2011 (UTC)
Ah, sorry I misread. That's odd, when I tried downloading a dump just now, the speed was between 500 KB/s and 1 MB/s most of the time. And I didn't have problems with slow downloads in the past either. Svick ( talk) 21:59, 19 February 2011 (UTC)
Another advantage to torrents is being able to download the dump on an unreliable network. — Preceding unsigned comment added by 61.128.234.243 ( talk) 03:00, 10 June 2011 (UTC)

Dump

Is there a dump of wikipedia articles without the brackets. Clean text without the brackets links and references like [[ ]] and <refences>s? Clean and plain text. I'm just not sure if I find this information in the article? Thanks. 71.33.206.156 ( talk) 10:24, 5 March 2011 (UTC)

It should be easy to run something like sed or a python script on the database dump to strip all of that formatting out. — Preceding unsigned comment added by RaptorHunter ( talkcontribs) 02:47, 11 March 2011 (UTC)
Use this tool: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor — Preceding unsigned comment added by 61.128.234.243 ( talk) 03:42, 10 June 2011 (UTC)

June 2011 Database Dump much smaller than May 2011 Database Dump

It appears that http://download.wikimedia.org/enwiki/20110620/enwiki-20110620-pages-articles.xml.bz2 file is 5.8GB, whereas http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-articles.xml.bz2 is 6.8GB. The big difference between these two files suggest a problem in the dump process. Any idea about what might have happened here? Thanks! — Preceding unsigned comment added by 65.119.214.2 ( talk) 20:28, 7 July 2011 (UTC)

See a message on the xmldatadumps-l mailing list. User<Svick>. Talk() ; 17:38, 8 July 2011 (UTC)

History and logs

Is there any chance of getting just the history pages and logs for the wiki? I am interested in researching editing patterns and behaviors and I don't need any article text (especially not every version of article text). Or is there a better place to get this information? — Bility ( talk) 20:06, 15 August 2011 (UTC)

Yes, go to http://dumps.wikimedia.org/enwiki/latest/ and download the pages-logging.xml.gz and stub-meta-history.xml.gz files only. emijrp ( talk) 22:39, 15 August 2011 (UTC)
Excellent, thank you! — Bility ( talk) 23:57, 15 August 2011 (UTC)

Database Torrent

I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.

I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.

Torrent Link — Preceding unsigned comment added by RaptorHunter ( talkcontribs) 02:58, 11 March 2011 (UTC)

At this exact moment, it looks like there are 15 seeders with 100% of the file available to share via torrent at http://thepiratebay.org/torrent/6234361SbmeirowTalk • 06:19, 25 April 2011 (UTC)

Wikipedia Namespace

I wrote a script to cut out all of the Wikipedia namespace from the database dump. This includes stuff like WP:AFD WP:Village Pump and hundreds of pages of archives that most people don't want. Removing it saves about 1.2GiB or 19% I also cut out the File: MediaWiki: and Help: Namespaces.

Tell me if you think those namespaces should be cut from my next torrent.-- RaptorHunter ( talk) 22:47, 24 April 2011 (UTC)

It would make sense to remove stuff that isn't important so to shrink the size, but unfortunately I don't know what should or shouldn't be in the database. I highly recommend that you contact all vendors / people that have written wikipedia apps to ask for their input. • SbmeirowTalk • 06:11, 25 April 2011 (UTC)
Is there anyway you can re-script this to remove every article except those with the Template name space. Such as Template:Infobox user or Template:War ? I ask because I've been looking to get a hold of just the Template name spaces but every thing I've seen requires several arbitrary steps that are both confusing and not documented well at all. -- RowenStipe ( talk) 10:43, 19 September 2011 (UTC)
I have posted the script I use on pastebin. http://pastebin.com/7sg8eLeX
Just edit the bad_titles= line to cut whatever namespaces you want. The usage comment at the top of the script explains how to run it in linux. — Preceding unsigned comment added by 71.194.190.179 ( talk) 22:22, 29 December 2011 (UTC)

Text Viewer

I have the latest dump of wikipedia articles as .xml file in hope to sort, and use some only on a few topics, then upload to a mysql database as a supplement to an online dictionary.

Which is the best text editor program people use for the 33 GB xml text file (UltraEdit / VIM)? I only have 8 GB ram on this computer- not sure if it will just load whatever part I'm currently viewing or use scratch disk or just crash.

Also, the filename is enwiki-latest-pages-articles.xml.bz2.xml.bz2.xml - I used 7 zip to expand the first level, but what would I use to uncompress next? Isn't xml already uncompressed?

Thanks to anyone who is familiar with this. -- Adam00 ( talk) 16:37, 13 October 2011 (UTC)

It's not very rich in features (it can run macros) but textpad just opens the files as if they are 1 kb txt. When you scroll it loads that part of the file. If you scroll a lot it will fill up the ram. 84.106.26.81 ( talk) 23:45, 1 January 2012 (UTC)
I used 'less' (as in the linux command). It opens the 34GB file instantly, and never uses more than a hundred KB of ram even though I have been scrolling and jumping around the buffer for a couple hours now. Jimreynold2nd ( talk) 05:32, 29 February 2012 (UTC)

Multistream dump index is broken

The multistream dump used a signed 32bit variable as the offset, and as such, offsets of articles past 2GB overflowed and are negative numbers. Did anybody notice this? Jimreynold2nd ( talk) 05:32, 29 February 2012 (UTC)

I have just noticed that on a 20120104 enwiki dump. Has this already been fixed in newer dumps? 177.28.90.160 ( talk) 22:30, 30 April 2012 (UTC)
Well, here is a small program to fix the offsets, as a temporary solution: http://inf.ufrgs.br/~vbuaraujo/foo/c/fix_enwiki_index_offset.c
It reads the uncompressed index from stdin, and prints the fixed index to stdout. 177.28.90.160 ( talk) 22:57, 30 April 2012 (UTC)

the name of the software used to split the wikipedia dump

Hello, I use BzReader to read Wikipedia offline, the problem is that the size of the English dumps is very big (7.5 GB) and BzReader don't succeed to create the index. On the site we can download the dumps in parts, but as there are many parts (28) the search became very slow. I search the name of the software they use to split the dump, so I will split it just in two parts. Thank you and sorry for my English. Rabah201130 ( talk) 10:52, 21 May 2012 (UTC)

i would like to change the file name created when I click "Download as PDF"

Right now, I always get a file: collection.pdf I want the name to be something like: wiki-'title'.pdf — Preceding unsigned comment added by 96.255.124.207 ( talk) 23:53, 11 June 2012 (UTC)

Wikipedia uses the "Collection" extension for MediaWiki to generate the PDF. See http://www.mediawiki.org/wiki/Extension:Collection and http://www.mediawiki.org/wiki/Extension:Collection/Wishlist I'm NOT an expert on the subject, but just know this is what Wikipedia uses. See http://en.wikipedia.org/wiki/Special:Version for proof. • SbmeirowTalk • 04:39, 12 June 2012 (UTC)
How can we change the title of the pdf to correspond to the article? Is there a more relevant project page I should ask this on? — Preceding unsigned comment added by 96.255.124.207 ( talk) 01:05, 13 June 2012 (UTC)
Possibly at http://www.mediawiki.org/wiki/Extension_talk:CollectionSbmeirowTalk • 04:06, 13 June 2012 (UTC)

Accessing an editable wiki offline?

The database download methods described here allow us to read a copy of the wiki offline. Is there a way to access an offline copy of the wiki that not only can be read as if it were the real thing, but also edited as if it were the real thing? I'm not asking for a distributed revision control system, in which the wiki would automatically update as individual users make their own edits. Rather, I'm wondering if it's possible to have an editable offline copy of the wiki, that someone could edit, only being able to modify the real wiki if manually copying changes from the offline version to the online one, article by article. 201.52.88.3 ( talk) 02:50, 16 June 2012 (UTC)

Dump strategy and file size

Hi. I notice that the monthly dump of enwiki "Articles, templates, media/file descriptions, and primary meta-pages" (e.g. as found at dumps.wikimedia.org/enwiki/20120403) is available as 27 XML files (e.g. starting with "enwiki-20120403-pages-articles1.xml-p000000010p000010000.bz2" and ending with "enwiki-20120403-pages-articles27.xml-p029625017p035314669.bz2").

  • What is the strategy use to determine which article ends up in which file?
  • Why is file 27 considerably larger than the other files? Will 27 continue to grow, or will it spill over into a 28th file?

Thanks in advance for your assistance. GFHandel    21:41, 24 April 2012 (UTC)

Hi again. Just wondering if anyone here is able to answer my questions? Thanks in advance if anyone can. GFHandel    21:22, 27 July 2012 (UTC)

E-book reader format?

I'd like to have a dump of wikipedia that is:

  1. readable offline
  2. with images (optional)
  3. readable on a e-book reader (like the Amazon Kindle or the Nook)

Yes I know about the WikiReader, but i don't be stuck reading *only* wikipedia with the device. :)

Is that a realistic expectation, or should i just get a normal ebook reader with wifi capabilities and read wikipedia from there? And if so, which one has satisfactory rendering? -- TheAnarcat ( talk) 14:32, 26 July 2012 (UTC)

Just the science articles please...

Is there anyway to break out and download just the science articles? Just math, science, history, biographies of those involved? I know the sample detection wouldn't be perfect, but I want a lighter-weight dump without all the pop culture, sports, music, etc. Anyone already done this? 24.23.123.92 ( talk) 21:14, 27 July 2012 (UTC)

Article namespace only

I'd like to just get article namespace as I have download usage limits. Regards, Sun Creator( talk) 00:03, 27 September 2012 (UTC)

all Wikipedia on torrent

It's possible to make a torrent with all (images, articles) wikipedia ? Maybe in torrent parts.

How Many TB ? -- CortexA9 ( talk) 15:56, 1 November 2012 (UTC)

Wikipedia versions missing?

This is a very useful resource, but are there some versions missing? I am especially looking for the http://fiu-vro.wikipedia.org (Võru) one (boasting 5090 articles), but with this one missing there may be others as well. Any clue, anyone? Trondtr ( talk) 16:28, 28 January 2013 (UTC).

Image dumpds are defective or inaccesible, Images torrent tracker doesn't work either

All links to image dumps in this article are bad, all bit torrents are not working. As of now there is no way to download the images other then scraping the article pages. This will choke wiki bandwith but looks like people have no other choice. I think sysops should take a look at this, unless of cours this is intentional,.... is it ??? its still not working?? please have a look — Preceding unsigned comment added by 116.73.4.171 ( talk) 13:49, 9 March 2013 (UTC)

Supporting multiple wikipedia languages concurrently

I often use Wikipedia as a dictionary and if I work online I can find an article in the English Wikipedia and then use the link on the left side to jump to another language. How can I do this offline? For example. if I use WIkitaxi I can just load one database and its not possible to jump to another language version of the current article.

Information about which articles discuss more or less the same topic but in a different language ("interlanguage links") has recently been moved to Wikidata. As with other WikiMedia projects, database dumps are available; see Wikidata:Database_download. - TB ( talk) 08:43, 5 June 2013 (UTC)

Total size over time?

I think it would be very interesting to explain what all the files add up to presently, and how it has changed over time. Specifically, I am thinking that Moore's law type effects on computer memory probably exceed the growth of the database, so that at some predictable future time it will be easy for most people to have the entire database in a personal device. Could you start a section providing these statistics for us, just as a matter of general interest? Thanks. Wnt ( talk) 17:03, 18 March 2013 (UTC)

There's lots of statistical information about Wikipedia available at Wikipedia:Statistics. Most relevant to your enquiry is probably Wikipedia:Size of Wikipedia. In a nutshell, it's entirely possible to build yourself a (read-only) Hitchhikers Guide to the Galaxy today - its just a bit expensive and slow. - TB ( talk) 08:53, 5 June 2013 (UTC)

Videos

Youtube | Vimeo | Bing

Websites

Google | Yahoo | Bing

Encyclopedia

Google | Yahoo | Bing

Facebook