This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 |
http://static.wikipedia.org/
Hello, the Static HTML dumps download page is never working. Is there a non-wikipedian mirror where we can download the complete HTML version of en.wiki (and fr:wiki as well)? Or maybe a torrent? 13:16, 5 September 2007 (UTC)
The enwiki database dumps seem to either get cancelled or fail. Surely considering the importance of English wikipedia this seems like a critical problem, is anyone working on fixing it? Looks like there hasn't been a proper backup of enwiki for over 2 months. This needs escalating, but there are no obvious ways of doing so. -- Alun Liggins 19:51, 5 October 2007 (UTC)
Hello, could someone who knows about this please post the link to the exact file, pages-articles.xml.bz2, for me please. The more recent the better. I did read the explanations but when I go to http://download.wikimedia.org/enwiki/ I can't download anything useful. When I get to http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml I then click http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2 but I get error 404. Are there any older versions I can download ? Thanks in advance. Jackaranga 01:00, 20 October 2007 (UTC)
Hi, I'm editing the article MS. Actually, it's more of a disambiguation page or a list. There is a little debate there about how we could format the list. I couldn't help but think of my good old database programming class. There we could pick and chose how we wanted the information to be displayed. I think it would be really handy to be able to make a table and sort it in various ways and have it display on wikipedia the way that the final user would like. For example, the article could sort the list by MS, mS, Ms, M.S., etc... or By category: medical, Aviation, etc..., then by alphabetical, etc...? I can't place 2 and 2 toghether on how SQL and a regular articles (wikipedia's database) technologies could be implemented together. -- CyclePat ( talk) 03:28, 26 November 2007 (UTC)
Are there any dumps of only the main namespace? It would be simple to do on my own, but it would be time consuming and memory intensive, and it seems like this would be something useful for other users. My parser is getting bogged down on old archived Wikipedia namespace pages which aren't particularly useful for my purposes, so it would be nice to have only the actual articles. Thanks. Pkalmar ( talk) 01:58, 21 December 2007 (UTC)
I can't seem to be able read the file on my computer. Any help? And were is the current dump, I accidently downloaded a old one. ( TheFauxScholar ( talk) 03:27, 4 April 2008 (UTC))
You would think the wikimedia foundation, with all the funding it gets, would be able to actually deliver the part of the open sources license that dictates that the "source" (i.e. dumps of the database) actually happen. Currently they constantly violate this, then shout and scream (as is the wikipedia way) at people who ask why there are no recent dumps. Hopefully someone will make a fork and run it properly, oh, but hang on, the wikimedia foundation "seem" to almost deliberately restrict access to the pictures.... so no fork of wikipedia then....! —Preceding unsigned comment added by 77.96.111.181 ( talk) 18:33, 18 May 2008 (UTC)
Does anyone know why no static HTML dump is available? I also asked this question here. Bovlb ( talk) 23:11, 22 May 2008 (UTC)
Where is the last worked full history dump?
http://www.mail-archive.com/dbpedia-discussion@lists.sourceforge.net/msg00135.html
http://download.wikimedia.org/enwiki/20080103/ —Preceding unsigned comment added by 93.80.82.194 ( talk) 19:07, 15 June 2008 (UTC)
am i insane, or does this article completely ignore user options with the mediawiki API? http://www.mediawiki.org/wiki/API Aaronbrick ( talk) 00:22, 8 October 2008 (UTC)
Hi there, f i download static html dumps how can i later organize them? I mean how can i browse thse articles in some gui or do i need just to browse it in windows directories? —Preceding unsigned comment added by 62.80.224.244 ( talk) 09:54, 20 October 2008 (UTC)
Hi all. I was wanting to access a subset of pages for the purposes of doing wordcount anaylsis on them. Maybe a couple hundred all told. I have tried to access these pages in XML via Special:Export (i.e. http://en.wikipedia.org/wiki/Special:Export/PAGE_NAME) but received a 403 back. I looked at Wikipedia's robots.txt but cannot find any rule for Special:Export itself. Anyone know why I might be failing here? AFAIK my IP is not banned or anything like that - it's a normal server-to-server request. —Preceding unsigned comment added by Nrubdarb ( talk • contribs) 11:08, 9 November 2008 (UTC)
24 January 2009: the newest available dump (with only the articles, the archive most mirror sites will probably want.) for the English Wikipedia enwiki is the version enwiki-20081008-pages-articles.xml.bz2 from October 2008. From the outside it looks as if a dump in progress of pages-meta-history.xml.bz2 blocks further exports. Is there a way to report this problem? —Preceding unsigned comment added by 92.75.194.40 ( talk) 17:22, 24 January 2009 (UTC)
There is an enwiki dump currently processing at the moment, but the previous one failed a number of jobs: [1]
In particular:
2007-04-02 14:05:27 failed Articles, templates, image descriptions, and primary meta-pages. This contains current versions of article content, and is the archive most mirror sites will probably want. pages-articles.xml.bz2
I can't seem to find any older dumps either...
67.183.26.86 17:56, 5 April 2007 (UTC)
I'm also trying to make a local mirror for use while traveling, but can't find a functioning older dump. Any word on when the next dump will be out?
I'm a little worried: I am a PhD student whose research is completely dependent on this data. With the last 2 dumps failing, and others being removed, there isnt a single complete dump of the english wikipedia data left available for download - and there hasn't been a new dump for almost a month. Has something happened? -- 130.217.240.32 04:47, 27 April 2007 (UTC)
A full, 84.6GB (compressed) dump has finally completed [2] (PLEASE, whoever runs the dumps, keep this one around at least until a newer version is complete, such that there's always at least one good, full dump). Regarding research, it's been suggested to dump random subsets in the requests for dumps on the meta site, but nothing's happened regarding it as of yet AFIACT.-- 81.86.106.14 22:38, 8 May 2007 (UTC)
Over at Wikipedia:Improving referencing efforts we're trying to improve article referencing. It would be helpful if we had statistics on the proportion of articles with references, average references per article, etc. Does anyone have a bot that could help with this? Thanks. - Peregrine Fisher ( talk) ( contribs) 21:49, 15 December 2008 (UTC)
Can we get access to the wikipedia search queries? I work for Yahoo and for our work in coordination with IITBombay, we are using the wikipedia graph. Please let me know at manish_gupta12003 [at] yahoo.com if we can have an access to a small subset of such queries (without any session info or user info). Thanks! —Preceding unsigned comment added by 216.145.54.7 ( talk) 07:36, 28 April 2009 (UTC)
I'm finding articles missing from the dumps.
Lets take a look at the first article in the dump:
AmericanSamoa (id 6)
#REDIRECT [[American Samoa]]{{R from CamelCase}}
Looking for redirect (American Samoa) in the dump:
grep -n ">American Samoa" enwiki-20090610-pages-articles.xml
72616: <title>American Samoa/Military</title>
6006367: <title>American Samoa/Economy</title>
7298422: <title>American Samoa/People</title>
7849801: <title>American Samoa/Geography</title>
7849816: <title>American Samoa/Government</title>
7849831: <title>American Samoa/Communications</title>
7849846: <title>American Samoa/Transportation</title>
44729785: <title>American Samoa at the 2004 Summer Olympics</title>
46998496: <title>American Samoan</title>
47801378: <title>American Samoa national football team</title>
65053327: <title>American Samoa territory, United States</title>
69169386: <title>American Samoa at the 2000 Summer Olympics</title>
74281114: <title>American Samoa Community College</title>
75011596: <text xml:space="preserve">American Samoa<noinclude>
79006314: <title>American Samoa national rugby league team</title>
97301557: <title>American Samoa National Park</title>
108306703: <title>American Samoa Territory Constitution</title>
119786420: <title>American Samoa at the 1996 Summer Olympics</title>
120802909: <title>American Samoa Fono</title>
120802959: <title>American Samoa House of Representatives</title>
120803207: <title>American Samoa Senate</title>
124460055: <title>American Samoa's At-large congressional district</title>
129254992: <title>American Samoan territorial legislature</title>
138027379: <title>American Samoa Football Association</title>
Can anyone shed any light on this? -- Smremde ( talk) 12:01, 25 June 2009 (UTC)
mysql> select page_id, page_title from page where page_namespace = 0 and page_title LIKE 'American_Samoa%' ORDER by 1 ASC; +----------+-------------------------------------------------------------+ | page_id | page_title | +----------+-------------------------------------------------------------+ | 1116 | American_Samoa/Military | | 57313 | American_Samoa/Economy | | 74035 | American_Samoa/People | | 82564 | American_Samoa/Geography | | 82565 | American_Samoa/Government | | 82566 | American_Samoa/Communications | | 82567 | American_Samoa/Transportation | | 924548 | American_Samoa_at_the_2004_Summer_Olympics | | 997978 | American_Samoan | | 1024704 | American_Samoa_national_football_team | | 1687185 | American_Samoa_territory,_United_States | | 1843552 | American_Samoa_at_the_2000_Summer_Olympics | | 2058267 | American_Samoa_Community_College | | 2262625 | American_Samoa_national_rugby_league_team | | 3168052 | American_Samoa_National_Park | | 3754535 | American_Samoa_Territory_Constitution | | 4442190 | American_Samoa_at_the_1996_Summer_Olympics | | 4504211 | American_Samoa_Fono | | 4504216 | American_Samoa_House_of_Representatives | | 4504225 | American_Samoa_Senate | | 4738483 | American_Samoa's_At-large_congressional_district | | 5072984 | American_Samoan_territorial_legislature | | 5656610 | American_Samoa_Football_Association | | 7499304 | American_Samoa_at_the_1988_Summer_Olympics | | 7821424 | American_Samoa_women's_national_football_team | | 7855525 | American_Samoa_at_the_1994_Winter_Olympics | | 7855530 | American_Samoa_at_the_1992_Summer_Olympics | | 7873035 | American_Samoa_Territorial_Police | | 8213414 | American_Samoa_Department_of_Education | | 8233086 | American_Samoan_general_election,_2008 | | 9682159 | American_Samoa_Constitution | | 10158999 | American_Samoa_national_rugby_union_team | | 11944957 | American_Samoan_general_election,_2004 | | 11944974 | American_Samoan_legislative_election,_2006 | | 11944988 | American_Samoan_general_election,_2006 | | 12095118 | American_Samoa_national_soccer_team | | 12225196 | American_Samoa_national_football_team_results | | 12869869 | American_Samoa_at_the_2007_World_Championships_in_Athletics | | 12999445 | American_Samoa_national_basketball_team | | 14938467 | American_Samoa_at_the_Olympics | | 15587944 | American_Samoa's_results_and_fixtures | | 15611124 | American_Samoa_Democratic_caucuses,_2008 | | 15641610 | American_Samoa_Republican_caucuses,_2008 | | 15681703 | American_Samoa_Republican_caucuses | | 15929695 | American_Samoa_Territory_Supreme_Court | | 16089779 | American_Samoa_at_the_2008_Summer_Olympics | | 19468499 | American_Samoa_Governor | | 19785703 | American_Samoa_Supreme_Court | | 19850654 | American_Samoa_gubernatorial_election,_2008 | | 19913142 | American_Samoa_Power_Authority | | 20611195 | American_Samoa | +----------+-------------------------------------------------------------+ 51 rows in set (0.00 sec)
I was trying to donwload the articles´ abstracts of the english wikipedia and just realized that the files are just a few KB big and only contain a couple of abstracts. I chequed older versions of the dump and dumps in other languages and there is the same situation. Does anyone know how to fix the dump of the articles´abstracts or know any other way to get those abstracts? —Preceding unsigned comment added by 141.45.202.255 ( talk) 10:07, 15 September 2009 (UTC)
Possibly a dumb question, but which enwiki dumps contain redirects (i.e. entry including title & text). It would seem my "enwiki-20090902-pages-articles" doesn't have them? Thanks Rjwilmsi 23:13, 20 January 2010 (UTC)
- I've installed WIkimedia script
- downloaded and installed all tables with exclusion of OLD_TABLES...
Wikimedia don't works! I need absolutely the old_tables or not??
Thanks
Davide
—Preceding unsigned comment added by 80.116.41.176 ( talk) 19:27, 1 July 2005 (UTC)
Recent mailing list posts [3] [4] [5] indicate developers have been busy with the Mediawiki 1.5 update recently, but that smaller, more specific dumps can now more easily be created. I'm hoping will see an article-namespace-only dump. The problem with talk pages is that the storage space required for them will grow without bound, whereas we can actually try to reduce the storage space needed for articles - or at least slow growth - by merging redundant articles, solving content problems, etc. Such a dump would also make analysis tools that look only at encyclopedic content (and there are a growing number of useful reports - see Wikipedia:Offline reports and Template:Active Wiki Fixup Projects) run faster and not take up ridiculously large amounts of hard drive space (which makes it more difficult to find computers capable of producing them).
User:Radiant! has been asking me about getting more frequent updates of category-related reports for a "Categorization Recent Changes Patrol" project. This is difficult to do without more frequent database dumps, and I'm sure there are a number of other reports that could benefit from being produced more frequently, or at least containing more up to date information. (And less human editor time would be wasted as a result, we hope.)
I'm actually surprised that the database dumps must be produced manually; I'm sure there's an engineering solution that could automate the process, reducing the burden on developers. I hope the developers will be able to deal with performance and other issues to be able to do this in the near future, so we can stop nagging them about database dump updates and get on with the work of fixing up the wiki. -- Beland 03:48, 8 July 2005 (UTC)
Check WikiFilter.
It works with all current wiki project dump files in all languages. You do not need PHP or Mysql. All you need is a web server like Apache, and then you can view a wiki page through your web browser, either in normal html format, or in the raw wikitext format.
Rebuilding an index data-base is also reasonably fast. For example, the 3-GB English Wikipedia takes about 10 minutes on a Pentium4. Wanchun ( Talk) 06:35, 20 September 2005 (UTC)
Is there a good, approved way to download a relatively small number of images (say ~4000)? The tar download is kinda slow for this purpose. It'd be faster to web crawl (even with the 1-second delay) than to deal with the tar. But that seems disfavoured.
Hi, I recently created a wiki for sharing code snippets and I've started to get some spam bots adding nonsense or spam to the articles. Rather than block them one at a time, I was thinking it'd be easier to preemptively use Wikipedia's IP block list and upload it into my wiki.
So, my question is, how do I download the block list -- and -- how do I upload it to my wiki? Thanks! - 206.160.140.2 13:49, 29 June 2006 (UTC)
If you are looking for an offline copy of wikipedia, I refer you to kiwix and a howto. Wizzy… ☎ 08:55, 27 April 2010 (UTC)
Why the enwiki dump completed in April/2010:
20100312 # 2010-04-16 08:46:23 done All pages with complete edit history (.7z) Everything is Ok * pages-meta-history.xml.7z 15.8 GB
is two times smaller than the previous dump completed in March/2010?
20100130 # 2010-03-26 00:57:06 done All pages with complete edit history (.7z) Everything is Ok * pages-meta-history.xml.7z 31.9 GB
-- Dc987 ( talk) 23:09, 29 April 2010 (UTC)
Is there somewhere where I can download some of the templates that are in use here on wikipedia?
—Preceding unsigned comment added by 74.135.40.211 ( talk)
here
Wowlookitsjoe 18:40, 25 August 2007 (UTC)
I am looking to download Wiki templates as well. What else will I need to work with it? FrontPage? Dreamweaver? —Preceding unsigned comment added by 84.59.134.169 ( talk) 19:31, August 27, 2007 (UTC)
I am looking for a way to download just the templates too. I am trying to import articles from Wikipedia that use templates, and nested templates preventing the articles from displaying correctly. Is there any way to just exclusively get the templates? —Preceding unsigned comment added by Coderoyal ( talk • contribs) 03:20, 15 June 2008 (UTC)
Yes agreed, even just a way to download the basic templates that you need to make an infobox and a few other things to get you started would be great.-- Kukamunga13 ( talk) 22:44, 8 January 2009 (UTC)
Yea i was searching for exactly for that described here. could anyone help? cheers
Found this discussion aswell when trying to export templates - further googling revealed there is a method using special:export and specifying the articles you want to export, these can then be imported with special:import or the included XML import utilities. -- James.yale ( talk) 08:53, 4 September 2009 (UTC)
I, too, would LOVE a way to download templates without having to wade through some truly disparate garbage that some people (thankfully not all) call "documentation." This is the biggest flaw in Wikipedia - crappy documentation that doesn't tell newbies how to find much of ANYthing. Liquidfractal ( talk) 01:26, 18 August 2010 (UTC)
Hi all. I'm a bit worried about all this raw data we are writing every day, aka Wikipedia and wikis. Due to the closing of Geocities by Yahoo!, I learnt about Archive Team (it's a must see). Also, I knew about Internet Archive. Webpages are weak, hard disks can be destroyed in accidents, ect.
Wikimedia Foundation offer us dumps for all the projects, and I'm not sure how many people download them all. Some days ago I wrote a small script (I can share it, GPL) which downloads all the backups (only the 7z files, about ~100GB, I think they are the most important). Are you interested in get a copy of all the Wiki[mp]edia public data? I don't know about public mirrors of Wikipedia (although there are many for Linux ISOs). There are no public dumps of Nupedia or their mailing lists, only available through Internet Archive. [6] [7]. Today, that is history.
I wrote a blog post about this. I think that this is very important, we are writing all the human knowledge, and it needs to be saved in every continent, every country around the globe. Please, write your comments and suggestions. Regards. emijrp ( talk) 12:05, 16 August 2010 (UTC)
Would adding http://freecache.org to the database dump files save bandwidth? Perhaps this could be experimented on for files < 1Gb & > 5Mb - Alterego
huge artiucle, lots of information but not a single innuedno as to what is the approximate size of the wikiepdia data base...Can anyone help me find that out?-- Procrastinating@ talk2me 13:13, 25 December 2006 (UTC)
Some info about the growing ratio of uploaded images every month. Sizes are in MB. The largest bunch of images uploaded was in 2010-07: 207,946 images, +345 GB, 1.6 MB per image average. Regards. emijrp ( talk) 11:36, 16 August 2010 (UTC)
date imagescount totalsize avgsizeperimage 2003-1 1 1.29717636 1.297176361084 2004-9 685 180.10230923 0.262923079163 2004-10 4171 607.82356930 0.145726101486 2004-11 3853 692.86077213 0.179823714543 2004-12 6413 1433.85031033 0.223584954050 2005-1 7754 1862.35064888 0.240179345999 2005-2 8500 2974.20539951 0.349906517590 2005-3 12635 3643.41866493 0.288359213687 2005-4 15358 5223.11017132 0.340090517731 2005-6 18633 6828.90809345 0.366495362714 2005-7 19968 8779.54927349 0.439680953200 2005-11 21977 9490.01213932 0.431815631766 2005-5 27811 10488.68217564 0.377141497092 2005-12 28540 10900.89676094 0.381951533320 2005-8 29787 12007.92203808 0.403126264413 2005-9 28484 13409.05960655 0.470757604499 2006-2 36208 14259.63428211 0.393825515966 2006-1 35650 14718.57859612 0.412863354730 2005-10 31436 16389.36539268 0.521356578212 2006-3 41763 18516.05062675 0.443360166338 2006-4 46645 21974.19114399 0.471094246843 2006-5 49399 28121.86408234 0.569280027578 2006-6 50893 28473.23626709 0.559472545676 2006-12 52577 29190.01748085 0.555186060080 2006-7 58384 30763.06628609 0.526909192349 2006-10 59611 32019.11058044 0.537134263482 2006-11 57070 32646.13846588 0.572036770035 2006-9 64991 35881.70751190 0.552102714405 2007-2 64787 37625.21335030 0.580752517485 2007-1 71989 38771.36503792 0.538573463139 2006-8 78263 48580.74944115 0.620737122793 2007-3 97556 51218.63132572 0.525017746994 2007-6 92656 60563.04164696 0.653633241743 2007-4 116127 69539.80562592 0.598825472336 2007-5 94416 69565.31412792 0.736795819860 2007-12 96256 72016.71782589 0.748179000020 2007-7 110470 72699.43968487 0.658092148863 2008-2 103295 73830.05222321 0.714749525371 2007-11 118250 78178.20839214 0.661126498031 2008-1 114507 80664.45367908 0.704449978421 2008-3 115732 84991.75799370 0.734384249764 2007-10 112011 85709.30096245 0.765186463494 2007-8 120324 87061.07968998 0.723555397842 2008-4 102342 93631.80365562 0.914891282715 2007-9 125487 95482.06631756 0.760892094939 2008-6 101048 95703.97241211 0.947113969718 2008-11 120943 110902.88121986 0.916984705356 2008-5 116491 112381.90718269 0.964726091996 2008-9 134362 114676.82158184 0.853491475133 2008-10 114676 116692.37883282 1.017583267927 2009-2 127162 119990.37194729 0.943602427984 2008-12 193181 120587.25548649 0.624219025093 2009-1 126947 121332.83420753 0.955775514250 2008-7 119541 122324.61021996 1.023285820095 2008-8 119471 124409.18394279 1.041333745786 2010-8 89776 164473.27284527 1.832040554773 2009-9 132803 176394.38845158 1.328240991932 2010-2 158509 182738.31155205 1.152857639327 2009-5 143389 182773.11001110 1.274666187860 2009-6 160070 186919.01145649 1.167732938442 2009-11 178118 188819.04353809 1.060078394874 2009-4 196699 202346.40691471 1.028710908112 2010-3 178439 206399.28073311 1.156693776210 2010-1 361253 216406.85650921 0.599045147055 2009-12 143797 217440.78340054 1.512137133602 2009-7 164040 230185.27826881 1.403226519561 2009-8 167079 250747.52404118 1.500772233741 2009-3 209453 260923.17462921 1.245736153835 2010-6 173658 270246.74141026 1.556200931775 2010-4 208518 297715.36278248 1.427768167652 2010-5 203093 297775.34260082 1.466201900611 2009-10 272818 329581.74736500 1.208064524207 2010-7 207946 345394.14518356 1.660979990880
Hi everyone! :-)
After reading the Download Wikipedia pages earlier today, a few ideas have popped into my mind for a compacted, stripped down version of Wikipedia that would (Hopefully) end up being small enough to fit on standard DVD-5 media, and might be a useful resource for communities and regions where Internet access is either unavailable or prohibitively expensive. The idea that I've currently got in mind is primarily based on the idea of storing everything in 6-bit encoding (Then BZIPped) which - Though leading to a loss of anything other than very basic formatting and line/paragraph breaks - Would still retain the text of the articles themselves, which is the most important part of Wikipedia's content! :-)
Anyhow, I'm prone to getting ideas in mind, and then never acting upon them or getting sidetracked by other things...So unless I ever get all of the necessary bits 'n' pieces already coded and ready for testing, I'm not even going to consider drinking 7.00GB of Wikipedia's bandwidth on a mere personal whim. That said - Before I give more thought to the ideas in mind - I just wanted to check a few things with more knowledgeable users:
Farewell for now, and many thanks in advance for any advice! >:-)
+++ DieselDragon +++ ( Talk) - 22 October 2010 CE = 23:44, 22 October 2010 (UTC)
Regards,
Rich
Farmbrough, 10:47, 17 November 2010 (UTC).
I note that it isn't currently possible to download a data dump due to server maintenance issues, however I did find a dump of a file called enwiki-20091017-pages-meta-current.xml.bz2 on BitTorrent via The Pirate Bay. Is this the same data as pages-articles.xml.bz2?
One further query: has the Wikimedia Foundation considered making its data dumps available via BitTorrent? If you did, it might remove some load from your servers. -- Cabalamat ( talk) 01:40, 14 December 2010 (UTC)
Hi all, a couple of days ago I wrote my problem in the LyricsWiki's Help Desk page but they didn't answer anything! I thought there is a relationship between Wikia projects & Wikipedia, so I decided to ask that here. This is my problem I wrote there:
"Hi, I'm lookin' for this wikia's dump, like any other wikia I went to "Statistics" page, and the Dump file (Which were in the bottom of the page in other Wikias) was not there!! Why there is not any dumps? Isn't it free in all Wiki projects? "
So, in [8] should be a dump of LyricsWiki, but there isn't! Why? Isn't it a free download? -- ♥MehranVB♥ ☻talk | ☺mail 16:14, 21 December 2010 (UTC)
it has been done for several hundreds years to manage the huge sizes of encyclopedias.It would be more manageble by users to download. And maybe it helps in creating the dump images too. Paragraph 2 states "The dump process has had difficulty scaling to cope with the size of the very largest wikis". Can you provide more info on what kind of difficulties where encountered ? Why can't "traditional" splitting into A - Z help the dump process? 00:56, 3 February 2011 (UTC) — Preceding unsigned comment added by Luky83 ( talk • contribs)
This article explains that the full dumps of the english language wikipedia have been failing, but doesn't indicate whether or not the problem is going to be resolved. Can we expect to see a full extract any time soon, or even for this to become routine again? — Preceding unsigned comment added by Jdownie ( talk • contribs) 11:56, 22 April 2011 (UTC)
Does anyone else feel that even though it says 'most' and it's probably true copyright infrigements are more common among images the statement is potentially confusing as it seems to suggest copyright infrigements are something unique to images. I was thinking of something like 'or believed'. And perhaps at the end 'Remember some content may be copyright infrigements (which should be deleted).' Nil Einne ( talk) 14:39, 16 January 2011 (UTC) Is there anyway to easily download all of these images? I could write a quick script to look at the tags on each File: page and filter out the non-free images.-- RaptorHunter ( talk) 22:53, 24 April 2011 (UTC)
I have created a new torrent for the May 2011 database dump. This torrent resolves some of the problems with the old torrent. It's also a gigabyte smaller (I cut out all of the Wikipedia: namespace pages) -- RaptorHunter ( talk) 01:57, 29 May 2011 (UTC)
http://thepiratebay.org/torrent/6430796
I noticed everyone whinning about downloads taking up precious wikipedia bandwidth, which makes me wonder why isn't the most popular and largest downloads NOT availabe as a torrent? Project Gutenberg has the option to download ebook CD's and DVD's via torrent. • Sbmeirow • Talk • 21:39, 21 December 2010 (UTC)
Is there a dump of wikipedia articles without the brackets. Clean text without the brackets links and references like [[ ]] and <refences>s? Clean and plain text. I'm just not sure if I find this information in the article? Thanks. 71.33.206.156 ( talk) 10:24, 5 March 2011 (UTC)
It appears that http://download.wikimedia.org/enwiki/20110620/enwiki-20110620-pages-articles.xml.bz2 file is 5.8GB, whereas http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-articles.xml.bz2 is 6.8GB. The big difference between these two files suggest a problem in the dump process. Any idea about what might have happened here? Thanks! — Preceding unsigned comment added by 65.119.214.2 ( talk) 20:28, 7 July 2011 (UTC)
Is there any chance of getting just the history pages and logs for the wiki? I am interested in researching editing patterns and behaviors and I don't need any article text (especially not every version of article text). Or is there a better place to get this information? — Bility ( talk) 20:06, 15 August 2011 (UTC)
I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.
I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.
Torrent Link — Preceding unsigned comment added by RaptorHunter ( talk • contribs) 02:58, 11 March 2011 (UTC)
I wrote a script to cut out all of the Wikipedia namespace from the database dump. This includes stuff like WP:AFD WP:Village Pump and hundreds of pages of archives that most people don't want. Removing it saves about 1.2GiB or 19% I also cut out the File: MediaWiki: and Help: Namespaces.
Tell me if you think those namespaces should be cut from my next torrent.-- RaptorHunter ( talk) 22:47, 24 April 2011 (UTC)
I have the latest dump of wikipedia articles as .xml file in hope to sort, and use some only on a few topics, then upload to a mysql database as a supplement to an online dictionary.
Which is the best text editor program people use for the 33 GB xml text file (UltraEdit / VIM)? I only have 8 GB ram on this computer- not sure if it will just load whatever part I'm currently viewing or use scratch disk or just crash.
Also, the filename is enwiki-latest-pages-articles.xml.bz2.xml.bz2.xml - I used 7 zip to expand the first level, but what would I use to uncompress next? Isn't xml already uncompressed?
Thanks to anyone who is familiar with this. -- Adam00 ( talk) 16:37, 13 October 2011 (UTC)
The multistream dump used a signed 32bit variable as the offset, and as such, offsets of articles past 2GB overflowed and are negative numbers. Did anybody notice this? Jimreynold2nd ( talk) 05:32, 29 February 2012 (UTC)
Hello, I use BzReader to read Wikipedia offline, the problem is that the size of the English dumps is very big (7.5 GB) and BzReader don't succeed to create the index. On the site we can download the dumps in parts, but as there are many parts (28) the search became very slow. I search the name of the software they use to split the dump, so I will split it just in two parts. Thank you and sorry for my English. Rabah201130 ( talk) 10:52, 21 May 2012 (UTC)
Right now, I always get a file: collection.pdf I want the name to be something like: wiki-'title'.pdf — Preceding unsigned comment added by 96.255.124.207 ( talk) 23:53, 11 June 2012 (UTC)
The database download methods described here allow us to read a copy of the wiki offline. Is there a way to access an offline copy of the wiki that not only can be read as if it were the real thing, but also edited as if it were the real thing? I'm not asking for a distributed revision control system, in which the wiki would automatically update as individual users make their own edits. Rather, I'm wondering if it's possible to have an editable offline copy of the wiki, that someone could edit, only being able to modify the real wiki if manually copying changes from the offline version to the online one, article by article. 201.52.88.3 ( talk) 02:50, 16 June 2012 (UTC)
Hi. I notice that the monthly dump of enwiki "Articles, templates, media/file descriptions, and primary meta-pages" (e.g. as found at
dumps
Thanks in advance for your assistance. GFHandel ♬ 21:41, 24 April 2012 (UTC)
I'd like to have a dump of wikipedia that is:
Yes I know about the WikiReader, but i don't be stuck reading *only* wikipedia with the device. :)
Is that a realistic expectation, or should i just get a normal ebook reader with wifi capabilities and read wikipedia from there? And if so, which one has satisfactory rendering? -- TheAnarcat ( talk) 14:32, 26 July 2012 (UTC)
Is there anyway to break out and download just the science articles? Just math, science, history, biographies of those involved? I know the sample detection wouldn't be perfect, but I want a lighter-weight dump without all the pop culture, sports, music, etc. Anyone already done this? 24.23.123.92 ( talk) 21:14, 27 July 2012 (UTC)
I'd like to just get article namespace as I have download usage limits. Regards, Sun Creator( talk) 00:03, 27 September 2012 (UTC)
It's possible to make a torrent with all (images, articles) wikipedia ? Maybe in torrent parts.
How Many TB ? -- CortexA9 ( talk) 15:56, 1 November 2012 (UTC)
This is a very useful resource, but are there some versions missing? I am especially looking for the http://fiu-vro.wikipedia.org (Võru) one (boasting 5090 articles), but with this one missing there may be others as well. Any clue, anyone? Trondtr ( talk) 16:28, 28 January 2013 (UTC).
All links to image dumps in this article are bad, all bit torrents are not working. As of now there is no way to download the images other then scraping the article pages. This will choke wiki bandwith but looks like people have no other choice. I think sysops should take a look at this, unless of cours this is intentional,.... is it ??? its still not working?? please have a look — Preceding unsigned comment added by 116.73.4.171 ( talk) 13:49, 9 March 2013 (UTC)
I often use Wikipedia as a dictionary and if I work online I can find an article in the English Wikipedia and then use the link on the left side to jump to another language. How can I do this offline? For example. if I use WIkitaxi I can just load one database and its not possible to jump to another language version of the current article.
I think it would be very interesting to explain what all the files add up to presently, and how it has changed over time. Specifically, I am thinking that Moore's law type effects on computer memory probably exceed the growth of the database, so that at some predictable future time it will be easy for most people to have the entire database in a personal device. Could you start a section providing these statistics for us, just as a matter of general interest? Thanks. Wnt ( talk) 17:03, 18 March 2013 (UTC)
This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 |
http://static.wikipedia.org/
Hello, the Static HTML dumps download page is never working. Is there a non-wikipedian mirror where we can download the complete HTML version of en.wiki (and fr:wiki as well)? Or maybe a torrent? 13:16, 5 September 2007 (UTC)
The enwiki database dumps seem to either get cancelled or fail. Surely considering the importance of English wikipedia this seems like a critical problem, is anyone working on fixing it? Looks like there hasn't been a proper backup of enwiki for over 2 months. This needs escalating, but there are no obvious ways of doing so. -- Alun Liggins 19:51, 5 October 2007 (UTC)
Hello, could someone who knows about this please post the link to the exact file, pages-articles.xml.bz2, for me please. The more recent the better. I did read the explanations but when I go to http://download.wikimedia.org/enwiki/ I can't download anything useful. When I get to http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml I then click http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2 but I get error 404. Are there any older versions I can download ? Thanks in advance. Jackaranga 01:00, 20 October 2007 (UTC)
Hi, I'm editing the article MS. Actually, it's more of a disambiguation page or a list. There is a little debate there about how we could format the list. I couldn't help but think of my good old database programming class. There we could pick and chose how we wanted the information to be displayed. I think it would be really handy to be able to make a table and sort it in various ways and have it display on wikipedia the way that the final user would like. For example, the article could sort the list by MS, mS, Ms, M.S., etc... or By category: medical, Aviation, etc..., then by alphabetical, etc...? I can't place 2 and 2 toghether on how SQL and a regular articles (wikipedia's database) technologies could be implemented together. -- CyclePat ( talk) 03:28, 26 November 2007 (UTC)
Are there any dumps of only the main namespace? It would be simple to do on my own, but it would be time consuming and memory intensive, and it seems like this would be something useful for other users. My parser is getting bogged down on old archived Wikipedia namespace pages which aren't particularly useful for my purposes, so it would be nice to have only the actual articles. Thanks. Pkalmar ( talk) 01:58, 21 December 2007 (UTC)
I can't seem to be able read the file on my computer. Any help? And were is the current dump, I accidently downloaded a old one. ( TheFauxScholar ( talk) 03:27, 4 April 2008 (UTC))
You would think the wikimedia foundation, with all the funding it gets, would be able to actually deliver the part of the open sources license that dictates that the "source" (i.e. dumps of the database) actually happen. Currently they constantly violate this, then shout and scream (as is the wikipedia way) at people who ask why there are no recent dumps. Hopefully someone will make a fork and run it properly, oh, but hang on, the wikimedia foundation "seem" to almost deliberately restrict access to the pictures.... so no fork of wikipedia then....! —Preceding unsigned comment added by 77.96.111.181 ( talk) 18:33, 18 May 2008 (UTC)
Does anyone know why no static HTML dump is available? I also asked this question here. Bovlb ( talk) 23:11, 22 May 2008 (UTC)
Where is the last worked full history dump?
http://www.mail-archive.com/dbpedia-discussion@lists.sourceforge.net/msg00135.html
http://download.wikimedia.org/enwiki/20080103/ —Preceding unsigned comment added by 93.80.82.194 ( talk) 19:07, 15 June 2008 (UTC)
am i insane, or does this article completely ignore user options with the mediawiki API? http://www.mediawiki.org/wiki/API Aaronbrick ( talk) 00:22, 8 October 2008 (UTC)
Hi there, f i download static html dumps how can i later organize them? I mean how can i browse thse articles in some gui or do i need just to browse it in windows directories? —Preceding unsigned comment added by 62.80.224.244 ( talk) 09:54, 20 October 2008 (UTC)
Hi all. I was wanting to access a subset of pages for the purposes of doing wordcount anaylsis on them. Maybe a couple hundred all told. I have tried to access these pages in XML via Special:Export (i.e. http://en.wikipedia.org/wiki/Special:Export/PAGE_NAME) but received a 403 back. I looked at Wikipedia's robots.txt but cannot find any rule for Special:Export itself. Anyone know why I might be failing here? AFAIK my IP is not banned or anything like that - it's a normal server-to-server request. —Preceding unsigned comment added by Nrubdarb ( talk • contribs) 11:08, 9 November 2008 (UTC)
24 January 2009: the newest available dump (with only the articles, the archive most mirror sites will probably want.) for the English Wikipedia enwiki is the version enwiki-20081008-pages-articles.xml.bz2 from October 2008. From the outside it looks as if a dump in progress of pages-meta-history.xml.bz2 blocks further exports. Is there a way to report this problem? —Preceding unsigned comment added by 92.75.194.40 ( talk) 17:22, 24 January 2009 (UTC)
There is an enwiki dump currently processing at the moment, but the previous one failed a number of jobs: [1]
In particular:
2007-04-02 14:05:27 failed Articles, templates, image descriptions, and primary meta-pages. This contains current versions of article content, and is the archive most mirror sites will probably want. pages-articles.xml.bz2
I can't seem to find any older dumps either...
67.183.26.86 17:56, 5 April 2007 (UTC)
I'm also trying to make a local mirror for use while traveling, but can't find a functioning older dump. Any word on when the next dump will be out?
I'm a little worried: I am a PhD student whose research is completely dependent on this data. With the last 2 dumps failing, and others being removed, there isnt a single complete dump of the english wikipedia data left available for download - and there hasn't been a new dump for almost a month. Has something happened? -- 130.217.240.32 04:47, 27 April 2007 (UTC)
A full, 84.6GB (compressed) dump has finally completed [2] (PLEASE, whoever runs the dumps, keep this one around at least until a newer version is complete, such that there's always at least one good, full dump). Regarding research, it's been suggested to dump random subsets in the requests for dumps on the meta site, but nothing's happened regarding it as of yet AFIACT.-- 81.86.106.14 22:38, 8 May 2007 (UTC)
Over at Wikipedia:Improving referencing efforts we're trying to improve article referencing. It would be helpful if we had statistics on the proportion of articles with references, average references per article, etc. Does anyone have a bot that could help with this? Thanks. - Peregrine Fisher ( talk) ( contribs) 21:49, 15 December 2008 (UTC)
Can we get access to the wikipedia search queries? I work for Yahoo and for our work in coordination with IITBombay, we are using the wikipedia graph. Please let me know at manish_gupta12003 [at] yahoo.com if we can have an access to a small subset of such queries (without any session info or user info). Thanks! —Preceding unsigned comment added by 216.145.54.7 ( talk) 07:36, 28 April 2009 (UTC)
I'm finding articles missing from the dumps.
Lets take a look at the first article in the dump:
AmericanSamoa (id 6)
#REDIRECT [[American Samoa]]{{R from CamelCase}}
Looking for redirect (American Samoa) in the dump:
grep -n ">American Samoa" enwiki-20090610-pages-articles.xml
72616: <title>American Samoa/Military</title>
6006367: <title>American Samoa/Economy</title>
7298422: <title>American Samoa/People</title>
7849801: <title>American Samoa/Geography</title>
7849816: <title>American Samoa/Government</title>
7849831: <title>American Samoa/Communications</title>
7849846: <title>American Samoa/Transportation</title>
44729785: <title>American Samoa at the 2004 Summer Olympics</title>
46998496: <title>American Samoan</title>
47801378: <title>American Samoa national football team</title>
65053327: <title>American Samoa territory, United States</title>
69169386: <title>American Samoa at the 2000 Summer Olympics</title>
74281114: <title>American Samoa Community College</title>
75011596: <text xml:space="preserve">American Samoa<noinclude>
79006314: <title>American Samoa national rugby league team</title>
97301557: <title>American Samoa National Park</title>
108306703: <title>American Samoa Territory Constitution</title>
119786420: <title>American Samoa at the 1996 Summer Olympics</title>
120802909: <title>American Samoa Fono</title>
120802959: <title>American Samoa House of Representatives</title>
120803207: <title>American Samoa Senate</title>
124460055: <title>American Samoa's At-large congressional district</title>
129254992: <title>American Samoan territorial legislature</title>
138027379: <title>American Samoa Football Association</title>
Can anyone shed any light on this? -- Smremde ( talk) 12:01, 25 June 2009 (UTC)
mysql> select page_id, page_title from page where page_namespace = 0 and page_title LIKE 'American_Samoa%' ORDER by 1 ASC; +----------+-------------------------------------------------------------+ | page_id | page_title | +----------+-------------------------------------------------------------+ | 1116 | American_Samoa/Military | | 57313 | American_Samoa/Economy | | 74035 | American_Samoa/People | | 82564 | American_Samoa/Geography | | 82565 | American_Samoa/Government | | 82566 | American_Samoa/Communications | | 82567 | American_Samoa/Transportation | | 924548 | American_Samoa_at_the_2004_Summer_Olympics | | 997978 | American_Samoan | | 1024704 | American_Samoa_national_football_team | | 1687185 | American_Samoa_territory,_United_States | | 1843552 | American_Samoa_at_the_2000_Summer_Olympics | | 2058267 | American_Samoa_Community_College | | 2262625 | American_Samoa_national_rugby_league_team | | 3168052 | American_Samoa_National_Park | | 3754535 | American_Samoa_Territory_Constitution | | 4442190 | American_Samoa_at_the_1996_Summer_Olympics | | 4504211 | American_Samoa_Fono | | 4504216 | American_Samoa_House_of_Representatives | | 4504225 | American_Samoa_Senate | | 4738483 | American_Samoa's_At-large_congressional_district | | 5072984 | American_Samoan_territorial_legislature | | 5656610 | American_Samoa_Football_Association | | 7499304 | American_Samoa_at_the_1988_Summer_Olympics | | 7821424 | American_Samoa_women's_national_football_team | | 7855525 | American_Samoa_at_the_1994_Winter_Olympics | | 7855530 | American_Samoa_at_the_1992_Summer_Olympics | | 7873035 | American_Samoa_Territorial_Police | | 8213414 | American_Samoa_Department_of_Education | | 8233086 | American_Samoan_general_election,_2008 | | 9682159 | American_Samoa_Constitution | | 10158999 | American_Samoa_national_rugby_union_team | | 11944957 | American_Samoan_general_election,_2004 | | 11944974 | American_Samoan_legislative_election,_2006 | | 11944988 | American_Samoan_general_election,_2006 | | 12095118 | American_Samoa_national_soccer_team | | 12225196 | American_Samoa_national_football_team_results | | 12869869 | American_Samoa_at_the_2007_World_Championships_in_Athletics | | 12999445 | American_Samoa_national_basketball_team | | 14938467 | American_Samoa_at_the_Olympics | | 15587944 | American_Samoa's_results_and_fixtures | | 15611124 | American_Samoa_Democratic_caucuses,_2008 | | 15641610 | American_Samoa_Republican_caucuses,_2008 | | 15681703 | American_Samoa_Republican_caucuses | | 15929695 | American_Samoa_Territory_Supreme_Court | | 16089779 | American_Samoa_at_the_2008_Summer_Olympics | | 19468499 | American_Samoa_Governor | | 19785703 | American_Samoa_Supreme_Court | | 19850654 | American_Samoa_gubernatorial_election,_2008 | | 19913142 | American_Samoa_Power_Authority | | 20611195 | American_Samoa | +----------+-------------------------------------------------------------+ 51 rows in set (0.00 sec)
I was trying to donwload the articles´ abstracts of the english wikipedia and just realized that the files are just a few KB big and only contain a couple of abstracts. I chequed older versions of the dump and dumps in other languages and there is the same situation. Does anyone know how to fix the dump of the articles´abstracts or know any other way to get those abstracts? —Preceding unsigned comment added by 141.45.202.255 ( talk) 10:07, 15 September 2009 (UTC)
Possibly a dumb question, but which enwiki dumps contain redirects (i.e. entry including title & text). It would seem my "enwiki-20090902-pages-articles" doesn't have them? Thanks Rjwilmsi 23:13, 20 January 2010 (UTC)
- I've installed WIkimedia script
- downloaded and installed all tables with exclusion of OLD_TABLES...
Wikimedia don't works! I need absolutely the old_tables or not??
Thanks
Davide
—Preceding unsigned comment added by 80.116.41.176 ( talk) 19:27, 1 July 2005 (UTC)
Recent mailing list posts [3] [4] [5] indicate developers have been busy with the Mediawiki 1.5 update recently, but that smaller, more specific dumps can now more easily be created. I'm hoping will see an article-namespace-only dump. The problem with talk pages is that the storage space required for them will grow without bound, whereas we can actually try to reduce the storage space needed for articles - or at least slow growth - by merging redundant articles, solving content problems, etc. Such a dump would also make analysis tools that look only at encyclopedic content (and there are a growing number of useful reports - see Wikipedia:Offline reports and Template:Active Wiki Fixup Projects) run faster and not take up ridiculously large amounts of hard drive space (which makes it more difficult to find computers capable of producing them).
User:Radiant! has been asking me about getting more frequent updates of category-related reports for a "Categorization Recent Changes Patrol" project. This is difficult to do without more frequent database dumps, and I'm sure there are a number of other reports that could benefit from being produced more frequently, or at least containing more up to date information. (And less human editor time would be wasted as a result, we hope.)
I'm actually surprised that the database dumps must be produced manually; I'm sure there's an engineering solution that could automate the process, reducing the burden on developers. I hope the developers will be able to deal with performance and other issues to be able to do this in the near future, so we can stop nagging them about database dump updates and get on with the work of fixing up the wiki. -- Beland 03:48, 8 July 2005 (UTC)
Check WikiFilter.
It works with all current wiki project dump files in all languages. You do not need PHP or Mysql. All you need is a web server like Apache, and then you can view a wiki page through your web browser, either in normal html format, or in the raw wikitext format.
Rebuilding an index data-base is also reasonably fast. For example, the 3-GB English Wikipedia takes about 10 minutes on a Pentium4. Wanchun ( Talk) 06:35, 20 September 2005 (UTC)
Is there a good, approved way to download a relatively small number of images (say ~4000)? The tar download is kinda slow for this purpose. It'd be faster to web crawl (even with the 1-second delay) than to deal with the tar. But that seems disfavoured.
Hi, I recently created a wiki for sharing code snippets and I've started to get some spam bots adding nonsense or spam to the articles. Rather than block them one at a time, I was thinking it'd be easier to preemptively use Wikipedia's IP block list and upload it into my wiki.
So, my question is, how do I download the block list -- and -- how do I upload it to my wiki? Thanks! - 206.160.140.2 13:49, 29 June 2006 (UTC)
If you are looking for an offline copy of wikipedia, I refer you to kiwix and a howto. Wizzy… ☎ 08:55, 27 April 2010 (UTC)
Why the enwiki dump completed in April/2010:
20100312 # 2010-04-16 08:46:23 done All pages with complete edit history (.7z) Everything is Ok * pages-meta-history.xml.7z 15.8 GB
is two times smaller than the previous dump completed in March/2010?
20100130 # 2010-03-26 00:57:06 done All pages with complete edit history (.7z) Everything is Ok * pages-meta-history.xml.7z 31.9 GB
-- Dc987 ( talk) 23:09, 29 April 2010 (UTC)
Is there somewhere where I can download some of the templates that are in use here on wikipedia?
—Preceding unsigned comment added by 74.135.40.211 ( talk)
here
Wowlookitsjoe 18:40, 25 August 2007 (UTC)
I am looking to download Wiki templates as well. What else will I need to work with it? FrontPage? Dreamweaver? —Preceding unsigned comment added by 84.59.134.169 ( talk) 19:31, August 27, 2007 (UTC)
I am looking for a way to download just the templates too. I am trying to import articles from Wikipedia that use templates, and nested templates preventing the articles from displaying correctly. Is there any way to just exclusively get the templates? —Preceding unsigned comment added by Coderoyal ( talk • contribs) 03:20, 15 June 2008 (UTC)
Yes agreed, even just a way to download the basic templates that you need to make an infobox and a few other things to get you started would be great.-- Kukamunga13 ( talk) 22:44, 8 January 2009 (UTC)
Yea i was searching for exactly for that described here. could anyone help? cheers
Found this discussion aswell when trying to export templates - further googling revealed there is a method using special:export and specifying the articles you want to export, these can then be imported with special:import or the included XML import utilities. -- James.yale ( talk) 08:53, 4 September 2009 (UTC)
I, too, would LOVE a way to download templates without having to wade through some truly disparate garbage that some people (thankfully not all) call "documentation." This is the biggest flaw in Wikipedia - crappy documentation that doesn't tell newbies how to find much of ANYthing. Liquidfractal ( talk) 01:26, 18 August 2010 (UTC)
Hi all. I'm a bit worried about all this raw data we are writing every day, aka Wikipedia and wikis. Due to the closing of Geocities by Yahoo!, I learnt about Archive Team (it's a must see). Also, I knew about Internet Archive. Webpages are weak, hard disks can be destroyed in accidents, ect.
Wikimedia Foundation offer us dumps for all the projects, and I'm not sure how many people download them all. Some days ago I wrote a small script (I can share it, GPL) which downloads all the backups (only the 7z files, about ~100GB, I think they are the most important). Are you interested in get a copy of all the Wiki[mp]edia public data? I don't know about public mirrors of Wikipedia (although there are many for Linux ISOs). There are no public dumps of Nupedia or their mailing lists, only available through Internet Archive. [6] [7]. Today, that is history.
I wrote a blog post about this. I think that this is very important, we are writing all the human knowledge, and it needs to be saved in every continent, every country around the globe. Please, write your comments and suggestions. Regards. emijrp ( talk) 12:05, 16 August 2010 (UTC)
Would adding http://freecache.org to the database dump files save bandwidth? Perhaps this could be experimented on for files < 1Gb & > 5Mb - Alterego
huge artiucle, lots of information but not a single innuedno as to what is the approximate size of the wikiepdia data base...Can anyone help me find that out?-- Procrastinating@ talk2me 13:13, 25 December 2006 (UTC)
Some info about the growing ratio of uploaded images every month. Sizes are in MB. The largest bunch of images uploaded was in 2010-07: 207,946 images, +345 GB, 1.6 MB per image average. Regards. emijrp ( talk) 11:36, 16 August 2010 (UTC)
date imagescount totalsize avgsizeperimage 2003-1 1 1.29717636 1.297176361084 2004-9 685 180.10230923 0.262923079163 2004-10 4171 607.82356930 0.145726101486 2004-11 3853 692.86077213 0.179823714543 2004-12 6413 1433.85031033 0.223584954050 2005-1 7754 1862.35064888 0.240179345999 2005-2 8500 2974.20539951 0.349906517590 2005-3 12635 3643.41866493 0.288359213687 2005-4 15358 5223.11017132 0.340090517731 2005-6 18633 6828.90809345 0.366495362714 2005-7 19968 8779.54927349 0.439680953200 2005-11 21977 9490.01213932 0.431815631766 2005-5 27811 10488.68217564 0.377141497092 2005-12 28540 10900.89676094 0.381951533320 2005-8 29787 12007.92203808 0.403126264413 2005-9 28484 13409.05960655 0.470757604499 2006-2 36208 14259.63428211 0.393825515966 2006-1 35650 14718.57859612 0.412863354730 2005-10 31436 16389.36539268 0.521356578212 2006-3 41763 18516.05062675 0.443360166338 2006-4 46645 21974.19114399 0.471094246843 2006-5 49399 28121.86408234 0.569280027578 2006-6 50893 28473.23626709 0.559472545676 2006-12 52577 29190.01748085 0.555186060080 2006-7 58384 30763.06628609 0.526909192349 2006-10 59611 32019.11058044 0.537134263482 2006-11 57070 32646.13846588 0.572036770035 2006-9 64991 35881.70751190 0.552102714405 2007-2 64787 37625.21335030 0.580752517485 2007-1 71989 38771.36503792 0.538573463139 2006-8 78263 48580.74944115 0.620737122793 2007-3 97556 51218.63132572 0.525017746994 2007-6 92656 60563.04164696 0.653633241743 2007-4 116127 69539.80562592 0.598825472336 2007-5 94416 69565.31412792 0.736795819860 2007-12 96256 72016.71782589 0.748179000020 2007-7 110470 72699.43968487 0.658092148863 2008-2 103295 73830.05222321 0.714749525371 2007-11 118250 78178.20839214 0.661126498031 2008-1 114507 80664.45367908 0.704449978421 2008-3 115732 84991.75799370 0.734384249764 2007-10 112011 85709.30096245 0.765186463494 2007-8 120324 87061.07968998 0.723555397842 2008-4 102342 93631.80365562 0.914891282715 2007-9 125487 95482.06631756 0.760892094939 2008-6 101048 95703.97241211 0.947113969718 2008-11 120943 110902.88121986 0.916984705356 2008-5 116491 112381.90718269 0.964726091996 2008-9 134362 114676.82158184 0.853491475133 2008-10 114676 116692.37883282 1.017583267927 2009-2 127162 119990.37194729 0.943602427984 2008-12 193181 120587.25548649 0.624219025093 2009-1 126947 121332.83420753 0.955775514250 2008-7 119541 122324.61021996 1.023285820095 2008-8 119471 124409.18394279 1.041333745786 2010-8 89776 164473.27284527 1.832040554773 2009-9 132803 176394.38845158 1.328240991932 2010-2 158509 182738.31155205 1.152857639327 2009-5 143389 182773.11001110 1.274666187860 2009-6 160070 186919.01145649 1.167732938442 2009-11 178118 188819.04353809 1.060078394874 2009-4 196699 202346.40691471 1.028710908112 2010-3 178439 206399.28073311 1.156693776210 2010-1 361253 216406.85650921 0.599045147055 2009-12 143797 217440.78340054 1.512137133602 2009-7 164040 230185.27826881 1.403226519561 2009-8 167079 250747.52404118 1.500772233741 2009-3 209453 260923.17462921 1.245736153835 2010-6 173658 270246.74141026 1.556200931775 2010-4 208518 297715.36278248 1.427768167652 2010-5 203093 297775.34260082 1.466201900611 2009-10 272818 329581.74736500 1.208064524207 2010-7 207946 345394.14518356 1.660979990880
Hi everyone! :-)
After reading the Download Wikipedia pages earlier today, a few ideas have popped into my mind for a compacted, stripped down version of Wikipedia that would (Hopefully) end up being small enough to fit on standard DVD-5 media, and might be a useful resource for communities and regions where Internet access is either unavailable or prohibitively expensive. The idea that I've currently got in mind is primarily based on the idea of storing everything in 6-bit encoding (Then BZIPped) which - Though leading to a loss of anything other than very basic formatting and line/paragraph breaks - Would still retain the text of the articles themselves, which is the most important part of Wikipedia's content! :-)
Anyhow, I'm prone to getting ideas in mind, and then never acting upon them or getting sidetracked by other things...So unless I ever get all of the necessary bits 'n' pieces already coded and ready for testing, I'm not even going to consider drinking 7.00GB of Wikipedia's bandwidth on a mere personal whim. That said - Before I give more thought to the ideas in mind - I just wanted to check a few things with more knowledgeable users:
Farewell for now, and many thanks in advance for any advice! >:-)
+++ DieselDragon +++ ( Talk) - 22 October 2010 CE = 23:44, 22 October 2010 (UTC)
Regards,
Rich
Farmbrough, 10:47, 17 November 2010 (UTC).
I note that it isn't currently possible to download a data dump due to server maintenance issues, however I did find a dump of a file called enwiki-20091017-pages-meta-current.xml.bz2 on BitTorrent via The Pirate Bay. Is this the same data as pages-articles.xml.bz2?
One further query: has the Wikimedia Foundation considered making its data dumps available via BitTorrent? If you did, it might remove some load from your servers. -- Cabalamat ( talk) 01:40, 14 December 2010 (UTC)
Hi all, a couple of days ago I wrote my problem in the LyricsWiki's Help Desk page but they didn't answer anything! I thought there is a relationship between Wikia projects & Wikipedia, so I decided to ask that here. This is my problem I wrote there:
"Hi, I'm lookin' for this wikia's dump, like any other wikia I went to "Statistics" page, and the Dump file (Which were in the bottom of the page in other Wikias) was not there!! Why there is not any dumps? Isn't it free in all Wiki projects? "
So, in [8] should be a dump of LyricsWiki, but there isn't! Why? Isn't it a free download? -- ♥MehranVB♥ ☻talk | ☺mail 16:14, 21 December 2010 (UTC)
it has been done for several hundreds years to manage the huge sizes of encyclopedias.It would be more manageble by users to download. And maybe it helps in creating the dump images too. Paragraph 2 states "The dump process has had difficulty scaling to cope with the size of the very largest wikis". Can you provide more info on what kind of difficulties where encountered ? Why can't "traditional" splitting into A - Z help the dump process? 00:56, 3 February 2011 (UTC) — Preceding unsigned comment added by Luky83 ( talk • contribs)
This article explains that the full dumps of the english language wikipedia have been failing, but doesn't indicate whether or not the problem is going to be resolved. Can we expect to see a full extract any time soon, or even for this to become routine again? — Preceding unsigned comment added by Jdownie ( talk • contribs) 11:56, 22 April 2011 (UTC)
Does anyone else feel that even though it says 'most' and it's probably true copyright infrigements are more common among images the statement is potentially confusing as it seems to suggest copyright infrigements are something unique to images. I was thinking of something like 'or believed'. And perhaps at the end 'Remember some content may be copyright infrigements (which should be deleted).' Nil Einne ( talk) 14:39, 16 January 2011 (UTC) Is there anyway to easily download all of these images? I could write a quick script to look at the tags on each File: page and filter out the non-free images.-- RaptorHunter ( talk) 22:53, 24 April 2011 (UTC)
I have created a new torrent for the May 2011 database dump. This torrent resolves some of the problems with the old torrent. It's also a gigabyte smaller (I cut out all of the Wikipedia: namespace pages) -- RaptorHunter ( talk) 01:57, 29 May 2011 (UTC)
http://thepiratebay.org/torrent/6430796
I noticed everyone whinning about downloads taking up precious wikipedia bandwidth, which makes me wonder why isn't the most popular and largest downloads NOT availabe as a torrent? Project Gutenberg has the option to download ebook CD's and DVD's via torrent. • Sbmeirow • Talk • 21:39, 21 December 2010 (UTC)
Is there a dump of wikipedia articles without the brackets. Clean text without the brackets links and references like [[ ]] and <refences>s? Clean and plain text. I'm just not sure if I find this information in the article? Thanks. 71.33.206.156 ( talk) 10:24, 5 March 2011 (UTC)
It appears that http://download.wikimedia.org/enwiki/20110620/enwiki-20110620-pages-articles.xml.bz2 file is 5.8GB, whereas http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-articles.xml.bz2 is 6.8GB. The big difference between these two files suggest a problem in the dump process. Any idea about what might have happened here? Thanks! — Preceding unsigned comment added by 65.119.214.2 ( talk) 20:28, 7 July 2011 (UTC)
Is there any chance of getting just the history pages and logs for the wiki? I am interested in researching editing patterns and behaviors and I don't need any article text (especially not every version of article text). Or is there a better place to get this information? — Bility ( talk) 20:06, 15 August 2011 (UTC)
I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.
I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.
Torrent Link — Preceding unsigned comment added by RaptorHunter ( talk • contribs) 02:58, 11 March 2011 (UTC)
I wrote a script to cut out all of the Wikipedia namespace from the database dump. This includes stuff like WP:AFD WP:Village Pump and hundreds of pages of archives that most people don't want. Removing it saves about 1.2GiB or 19% I also cut out the File: MediaWiki: and Help: Namespaces.
Tell me if you think those namespaces should be cut from my next torrent.-- RaptorHunter ( talk) 22:47, 24 April 2011 (UTC)
I have the latest dump of wikipedia articles as .xml file in hope to sort, and use some only on a few topics, then upload to a mysql database as a supplement to an online dictionary.
Which is the best text editor program people use for the 33 GB xml text file (UltraEdit / VIM)? I only have 8 GB ram on this computer- not sure if it will just load whatever part I'm currently viewing or use scratch disk or just crash.
Also, the filename is enwiki-latest-pages-articles.xml.bz2.xml.bz2.xml - I used 7 zip to expand the first level, but what would I use to uncompress next? Isn't xml already uncompressed?
Thanks to anyone who is familiar with this. -- Adam00 ( talk) 16:37, 13 October 2011 (UTC)
The multistream dump used a signed 32bit variable as the offset, and as such, offsets of articles past 2GB overflowed and are negative numbers. Did anybody notice this? Jimreynold2nd ( talk) 05:32, 29 February 2012 (UTC)
Hello, I use BzReader to read Wikipedia offline, the problem is that the size of the English dumps is very big (7.5 GB) and BzReader don't succeed to create the index. On the site we can download the dumps in parts, but as there are many parts (28) the search became very slow. I search the name of the software they use to split the dump, so I will split it just in two parts. Thank you and sorry for my English. Rabah201130 ( talk) 10:52, 21 May 2012 (UTC)
Right now, I always get a file: collection.pdf I want the name to be something like: wiki-'title'.pdf — Preceding unsigned comment added by 96.255.124.207 ( talk) 23:53, 11 June 2012 (UTC)
The database download methods described here allow us to read a copy of the wiki offline. Is there a way to access an offline copy of the wiki that not only can be read as if it were the real thing, but also edited as if it were the real thing? I'm not asking for a distributed revision control system, in which the wiki would automatically update as individual users make their own edits. Rather, I'm wondering if it's possible to have an editable offline copy of the wiki, that someone could edit, only being able to modify the real wiki if manually copying changes from the offline version to the online one, article by article. 201.52.88.3 ( talk) 02:50, 16 June 2012 (UTC)
Hi. I notice that the monthly dump of enwiki "Articles, templates, media/file descriptions, and primary meta-pages" (e.g. as found at
dumps
Thanks in advance for your assistance. GFHandel ♬ 21:41, 24 April 2012 (UTC)
I'd like to have a dump of wikipedia that is:
Yes I know about the WikiReader, but i don't be stuck reading *only* wikipedia with the device. :)
Is that a realistic expectation, or should i just get a normal ebook reader with wifi capabilities and read wikipedia from there? And if so, which one has satisfactory rendering? -- TheAnarcat ( talk) 14:32, 26 July 2012 (UTC)
Is there anyway to break out and download just the science articles? Just math, science, history, biographies of those involved? I know the sample detection wouldn't be perfect, but I want a lighter-weight dump without all the pop culture, sports, music, etc. Anyone already done this? 24.23.123.92 ( talk) 21:14, 27 July 2012 (UTC)
I'd like to just get article namespace as I have download usage limits. Regards, Sun Creator( talk) 00:03, 27 September 2012 (UTC)
It's possible to make a torrent with all (images, articles) wikipedia ? Maybe in torrent parts.
How Many TB ? -- CortexA9 ( talk) 15:56, 1 November 2012 (UTC)
This is a very useful resource, but are there some versions missing? I am especially looking for the http://fiu-vro.wikipedia.org (Võru) one (boasting 5090 articles), but with this one missing there may be others as well. Any clue, anyone? Trondtr ( talk) 16:28, 28 January 2013 (UTC).
All links to image dumps in this article are bad, all bit torrents are not working. As of now there is no way to download the images other then scraping the article pages. This will choke wiki bandwith but looks like people have no other choice. I think sysops should take a look at this, unless of cours this is intentional,.... is it ??? its still not working?? please have a look — Preceding unsigned comment added by 116.73.4.171 ( talk) 13:49, 9 March 2013 (UTC)
I often use Wikipedia as a dictionary and if I work online I can find an article in the English Wikipedia and then use the link on the left side to jump to another language. How can I do this offline? For example. if I use WIkitaxi I can just load one database and its not possible to jump to another language version of the current article.
I think it would be very interesting to explain what all the files add up to presently, and how it has changed over time. Specifically, I am thinking that Moore's law type effects on computer memory probably exceed the growth of the database, so that at some predictable future time it will be easy for most people to have the entire database in a personal device. Could you start a section providing these statistics for us, just as a matter of general interest? Thanks. Wnt ( talk) 17:03, 18 March 2013 (UTC)