TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Download the Entire Wikimedia Database

151 pointsby surroundabout 4 years ago

13 comments

dwheelerabout 4 years ago
I'm so glad the download-entire-wikipedia function continues to exist. That will help counter the "lost the entire library problem" from the city of Alexandria. To be fair, Wikipedia only has summaries, not the detailed material, but it's still important.
评论 #26371090 未加载
评论 #26370754 未加载
评论 #26374279 未加载
评论 #26374133 未加载
评论 #26371186 未加载
评论 #26371087 未加载
nayukiabout 4 years ago
Back in 2014 I computed the PageRanks within English Wikipedia, thanks to their database dump. <a href="https:&#x2F;&#x2F;www.nayuki.io&#x2F;page&#x2F;computing-wikipedias-internal-pageranks" rel="nofollow">https:&#x2F;&#x2F;www.nayuki.io&#x2F;page&#x2F;computing-wikipedias-internal-pag...</a>
评论 #26370579 未加载
评论 #26370749 未加载
orblivionabout 4 years ago
You can also get it in a user-friendly format with the application Kiwix (<a href="https:&#x2F;&#x2F;www.kiwix.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.kiwix.org&#x2F;</a>) if that&#x27;s your use case. PC, phone, or server. You get subsets of the data, and images are smaller to save space.
评论 #26370729 未加载
评论 #26370772 未加载
评论 #26374556 未加载
评论 #26374535 未加载
risabout 4 years ago
Genuine question: why is bittorrent not being used for this?
评论 #26371876 未加载
评论 #26372131 未加载
评论 #26374107 未加载
nromiunabout 4 years ago
I wish Wikipedia would offer incremental downloads (e.g. rsync). That would make it much easier to host your own Wikipedia.
dudusabout 4 years ago
Wikipedia is always bugging me about donations, and yet here it is a feature they could charge for or at least hint to donate. It would be perfectly acceptable to charge here since abuse of this can rack up quite a bill. Maybe they don&#x27;t pay as much as I do for outbound traffic on aws, but still
评论 #26370857 未加载
评论 #26370888 未加载
评论 #26371987 未加载
sleaveyabout 4 years ago
Further down the page:<p>&gt; Backup dumps of wikis which no longer exist [...] This includes, in particular, the Sept. 11 wiki.<p>There was a Sept. 11 wiki hosted by Wikimedia?
评论 #26371720 未加载
karlicossabout 4 years ago
Wouldn&#x27;t it be cool if Steam supported distributing offline Wikipedia database? It&#x27;s just a few gigs (depending on languages&#x2F;images&#x2F;etc, but it fits the DLC model perfectly), and it already uses bittorrent.
评论 #26371455 未加载
guerbyabout 4 years ago
Is the media content (images, videos) downloadable?<p>When I follow the links I find 2012-2013 data but may be I missed something?<p><a href="https:&#x2F;&#x2F;meta.wikimedia.org&#x2F;wiki&#x2F;Mirroring_Wikimedia_project_XML_dumps" rel="nofollow">https:&#x2F;&#x2F;meta.wikimedia.org&#x2F;wiki&#x2F;Mirroring_Wikimedia_project_...</a>
评论 #26373620 未加载
jokoonabout 4 years ago
There should be some form of compilation of quality articles per domain, like history, sciences, etc.<p>In a way all articles should belong to a category...
评论 #26382448 未加载
libraryofalexabout 4 years ago
One of the only things you can do to ensure lasting democracy today is to download the pages, with complete history, put it on a usb drive or microsd card properly labelled for you to keep offline, and just forget about it. You can do this as a consumer, it&#x27;s easy. There&#x27;s no harm in it, it&#x27;s not some kind of private data such as personal photos or documents. If you end up forgetting or losing track of it, it really is no big deal. You just decided to download it when you saw it on hacker news back in 2021, right?<p>My reason for saying this is one of the only things you can do to ensure lasting democracy is that it is in the realm of what is possible in a physical sense that at some point through some mechanism the online version simply does not inform the public on some important public issue, whereas the history as you can download it today does. Though, I wouldn&#x27;t speculate about what the mechanism might be or what kinds of subject.<p>At that point in a physical sense you could consult your offline copy on an airgapped PC or future equivalent and I think it would be impossible for any group of any kind to even know you were doing that let alone stop it.<p>How you might get the word out is another question but having this personal capability is easy for the people here, as technical users and simple consumers. Indeed the whole entire Internet was set up as a distributed network in case of nuclear attack, so the entire topology of the Internet is set up for you to do this easily today.<p>It&#x27;s a click and a cheap flash drive or slightly more expensive microsd card away. You can take this step in less than 20 active minutes of your time and for less than $50 if you go with an external spinning disk drive (such as 1 terabyte) or $200 or so if you go with a microsd card. It doesn&#x27;t really matter if the file ultimately fails, this is not a critical backup for you to have just a nice to have. You could write the file&#x27;s checksum onto the drive in marker so you can tell whether it&#x27;s still correct later (as opposed to having bit errors).<p>Maybe there is some file type that has a bit of redundancy (checksums) for long-term storage, since due to the large amount (several hundreds gigabytes) I wouldn&#x27;t be all that surprised if a few bits flipped over the course of several years in cold storage. But I don&#x27;t know what kind of file type has any sort of redundancy or parity built into it that is supposed to protect against this. (Does anyone know?) Most likely the hash just wouldn&#x27;t match what you wrote in pen on it but it would still be useable.<p>Regarding choice of spinning disk or microsd card: I guess it&#x27;s in the realm of what&#x27;s possible in a physical sense that at some point people would have their personal property rummaged through by some group and a hard drive is pretty obvious and could be stolen or removed for that reason. (In a physical sense, not speculating about social or political developments that might lead to that.)<p>So for this reason perhaps best would be to put it on a microsd card even though it is quite a bit more expensive. I guess written once, bit rot causes microsd cards to decay within a few years if not used at all.[1] I don&#x27;t know for spinning media but I guess it&#x27;s also about 5-10 years at least.[2]<p>You could put the microsd card under a postage stamp for example and put an important unrelated document into the envelope, which you would expect to keep for many years. Of course you could always end up accidentally discarding your envelope (while retaining its contents) but that risk shouldn&#x27;t matter too much. In a physical sense it is possible for groups to x-ray all paperwork (such as envelopes as I just suggested) and a microsd card&#x27;s electrical contacts are pretty obvious in an x-ray. (It looks like this [3]). I don&#x27;t have any suggestion that works against this attack, which is within the realm of what&#x27;s possible according to the laws of physics.<p>I&#x27;m not speculating on what social or political developments might possibly make anything like this necessary at some point in the future, but we still live in a world governed by the laws of physics so as technical professionals you have a huge leg up on most of the world. Spending $50 doing this today might save democracy tomorrow. You could also leave it as a time capsule however the storage longevity is not that long (between 5 and 20 years I guess), and in a physical sense, a time capsule is not particularly secure and would require instructions for someone else to figure out so it&#x27;s not great in that sense.<p>So in terms of what you can do today, I would suggest just getting an external 1 terabyte usb drive ($50), downloading the dump together with history (20 active minutes), writing the checksum onto it in marker and just putting it somewhere. Obviously this small $50 investment is one you would hope never to have to use, but who knows, you might go down in history as the one who saved some small part of the world. Though, obviously, not in Wikipedia history.<p>[1] <a href="https:&#x2F;&#x2F;www.quora.com&#x2F;What-is-the-longevity-of-a-sd-memory-card" rel="nofollow">https:&#x2F;&#x2F;www.quora.com&#x2F;What-is-the-longevity-of-a-sd-memory-c...</a><p>[2] <a href="https:&#x2F;&#x2F;serverfault.com&#x2F;questions&#x2F;986911&#x2F;how-long-will-unused-hard-drive-last" rel="nofollow">https:&#x2F;&#x2F;serverfault.com&#x2F;questions&#x2F;986911&#x2F;how-long-will-unuse...</a><p>[3] <a href="https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;pics&#x2F;comments&#x2F;3b6bjw&#x2F;i_xrayed_an_sd_card&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;pics&#x2F;comments&#x2F;3b6bjw&#x2F;i_xrayed_an_sd...</a>
Santosh83about 4 years ago
Are images&#x2F;media from Wikimedia Commons included in these dumps?
评论 #26373587 未加载
bawolffabout 4 years ago
Well not the entire db, just the public parts. User passwords are not included ;)