TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Download the Entire Wikimedia Database

151 点作者 surround大约 4 年前

13 条评论

dwheeler大约 4 年前
I'm so glad the download-entire-wikipedia function continues to exist. That will help counter the "lost the entire library problem" from the city of Alexandria. To be fair, Wikipedia only has summaries, not the detailed material, but it's still important.
评论 #26371090 未加载
评论 #26370754 未加载
评论 #26374279 未加载
评论 #26374133 未加载
评论 #26371186 未加载
评论 #26371087 未加载
nayuki大约 4 年前
Back in 2014 I computed the PageRanks within English Wikipedia, thanks to their database dump. <a href="https:&#x2F;&#x2F;www.nayuki.io&#x2F;page&#x2F;computing-wikipedias-internal-pageranks" rel="nofollow">https:&#x2F;&#x2F;www.nayuki.io&#x2F;page&#x2F;computing-wikipedias-internal-pag...</a>
评论 #26370579 未加载
评论 #26370749 未加载
orblivion大约 4 年前
You can also get it in a user-friendly format with the application Kiwix (<a href="https:&#x2F;&#x2F;www.kiwix.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.kiwix.org&#x2F;</a>) if that&#x27;s your use case. PC, phone, or server. You get subsets of the data, and images are smaller to save space.
评论 #26370729 未加载
评论 #26370772 未加载
评论 #26374556 未加载
评论 #26374535 未加载
ris大约 4 年前
Genuine question: why is bittorrent not being used for this?
评论 #26371876 未加载
评论 #26372131 未加载
评论 #26374107 未加载
nromiun大约 4 年前
I wish Wikipedia would offer incremental downloads (e.g. rsync). That would make it much easier to host your own Wikipedia.
dudus大约 4 年前
Wikipedia is always bugging me about donations, and yet here it is a feature they could charge for or at least hint to donate. It would be perfectly acceptable to charge here since abuse of this can rack up quite a bill. Maybe they don&#x27;t pay as much as I do for outbound traffic on aws, but still
评论 #26370857 未加载
评论 #26370888 未加载
评论 #26371987 未加载
sleavey大约 4 年前
Further down the page:<p>&gt; Backup dumps of wikis which no longer exist [...] This includes, in particular, the Sept. 11 wiki.<p>There was a Sept. 11 wiki hosted by Wikimedia?
评论 #26371720 未加载
karlicoss大约 4 年前
Wouldn&#x27;t it be cool if Steam supported distributing offline Wikipedia database? It&#x27;s just a few gigs (depending on languages&#x2F;images&#x2F;etc, but it fits the DLC model perfectly), and it already uses bittorrent.
评论 #26371455 未加载
guerby大约 4 年前
Is the media content (images, videos) downloadable?<p>When I follow the links I find 2012-2013 data but may be I missed something?<p><a href="https:&#x2F;&#x2F;meta.wikimedia.org&#x2F;wiki&#x2F;Mirroring_Wikimedia_project_XML_dumps" rel="nofollow">https:&#x2F;&#x2F;meta.wikimedia.org&#x2F;wiki&#x2F;Mirroring_Wikimedia_project_...</a>
评论 #26373620 未加载
jokoon大约 4 年前
There should be some form of compilation of quality articles per domain, like history, sciences, etc.<p>In a way all articles should belong to a category...
评论 #26382448 未加载
libraryofalex大约 4 年前
One of the only things you can do to ensure lasting democracy today is to download the pages, with complete history, put it on a usb drive or microsd card properly labelled for you to keep offline, and just forget about it. You can do this as a consumer, it&#x27;s easy. There&#x27;s no harm in it, it&#x27;s not some kind of private data such as personal photos or documents. If you end up forgetting or losing track of it, it really is no big deal. You just decided to download it when you saw it on hacker news back in 2021, right?<p>My reason for saying this is one of the only things you can do to ensure lasting democracy is that it is in the realm of what is possible in a physical sense that at some point through some mechanism the online version simply does not inform the public on some important public issue, whereas the history as you can download it today does. Though, I wouldn&#x27;t speculate about what the mechanism might be or what kinds of subject.<p>At that point in a physical sense you could consult your offline copy on an airgapped PC or future equivalent and I think it would be impossible for any group of any kind to even know you were doing that let alone stop it.<p>How you might get the word out is another question but having this personal capability is easy for the people here, as technical users and simple consumers. Indeed the whole entire Internet was set up as a distributed network in case of nuclear attack, so the entire topology of the Internet is set up for you to do this easily today.<p>It&#x27;s a click and a cheap flash drive or slightly more expensive microsd card away. You can take this step in less than 20 active minutes of your time and for less than $50 if you go with an external spinning disk drive (such as 1 terabyte) or $200 or so if you go with a microsd card. It doesn&#x27;t really matter if the file ultimately fails, this is not a critical backup for you to have just a nice to have. You could write the file&#x27;s checksum onto the drive in marker so you can tell whether it&#x27;s still correct later (as opposed to having bit errors).<p>Maybe there is some file type that has a bit of redundancy (checksums) for long-term storage, since due to the large amount (several hundreds gigabytes) I wouldn&#x27;t be all that surprised if a few bits flipped over the course of several years in cold storage. But I don&#x27;t know what kind of file type has any sort of redundancy or parity built into it that is supposed to protect against this. (Does anyone know?) Most likely the hash just wouldn&#x27;t match what you wrote in pen on it but it would still be useable.<p>Regarding choice of spinning disk or microsd card: I guess it&#x27;s in the realm of what&#x27;s possible in a physical sense that at some point people would have their personal property rummaged through by some group and a hard drive is pretty obvious and could be stolen or removed for that reason. (In a physical sense, not speculating about social or political developments that might lead to that.)<p>So for this reason perhaps best would be to put it on a microsd card even though it is quite a bit more expensive. I guess written once, bit rot causes microsd cards to decay within a few years if not used at all.[1] I don&#x27;t know for spinning media but I guess it&#x27;s also about 5-10 years at least.[2]<p>You could put the microsd card under a postage stamp for example and put an important unrelated document into the envelope, which you would expect to keep for many years. Of course you could always end up accidentally discarding your envelope (while retaining its contents) but that risk shouldn&#x27;t matter too much. In a physical sense it is possible for groups to x-ray all paperwork (such as envelopes as I just suggested) and a microsd card&#x27;s electrical contacts are pretty obvious in an x-ray. (It looks like this [3]). I don&#x27;t have any suggestion that works against this attack, which is within the realm of what&#x27;s possible according to the laws of physics.<p>I&#x27;m not speculating on what social or political developments might possibly make anything like this necessary at some point in the future, but we still live in a world governed by the laws of physics so as technical professionals you have a huge leg up on most of the world. Spending $50 doing this today might save democracy tomorrow. You could also leave it as a time capsule however the storage longevity is not that long (between 5 and 20 years I guess), and in a physical sense, a time capsule is not particularly secure and would require instructions for someone else to figure out so it&#x27;s not great in that sense.<p>So in terms of what you can do today, I would suggest just getting an external 1 terabyte usb drive ($50), downloading the dump together with history (20 active minutes), writing the checksum onto it in marker and just putting it somewhere. Obviously this small $50 investment is one you would hope never to have to use, but who knows, you might go down in history as the one who saved some small part of the world. Though, obviously, not in Wikipedia history.<p>[1] <a href="https:&#x2F;&#x2F;www.quora.com&#x2F;What-is-the-longevity-of-a-sd-memory-card" rel="nofollow">https:&#x2F;&#x2F;www.quora.com&#x2F;What-is-the-longevity-of-a-sd-memory-c...</a><p>[2] <a href="https:&#x2F;&#x2F;serverfault.com&#x2F;questions&#x2F;986911&#x2F;how-long-will-unused-hard-drive-last" rel="nofollow">https:&#x2F;&#x2F;serverfault.com&#x2F;questions&#x2F;986911&#x2F;how-long-will-unuse...</a><p>[3] <a href="https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;pics&#x2F;comments&#x2F;3b6bjw&#x2F;i_xrayed_an_sd_card&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;pics&#x2F;comments&#x2F;3b6bjw&#x2F;i_xrayed_an_sd...</a>
Santosh83大约 4 年前
Are images&#x2F;media from Wikimedia Commons included in these dumps?
评论 #26373587 未加载
bawolff大约 4 年前
Well not the entire db, just the public parts. User passwords are not included ;)