TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

How to download all of Wikipedia onto a USB flash drive

448 点作者 bubblehack3r超过 2 年前

52 条评论

nneonneo超过 2 年前
Circa 2009 or so, my absolute favorite app for the iPod Touch was Patrick Collison&#x27;s Offline Wikipedia (yes, that Patrick Collison: <a href="https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20100419194443&#x2F;http:&#x2F;&#x2F;collison.ie&#x2F;wikipedia-iphone&#x2F;" rel="nofollow">https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20100419194443&#x2F;http:&#x2F;&#x2F;collison.i...</a>). You could download various wikis that had been pre-processed to fit in a very small space - as I recall, the entire English Wikipedia was a mere 2 GB in size. It was simply magical that I could have access to all of Wikipedia anytime, anywhere offline - especially since the iPod Touch could only connect to the Internet via WiFi. It was particularly useful while travelling, since I could load up articles and just read them on the plane.<p>As I recall, there were several clever things that the app did to reduce the size of the dump; many stub&#x2F;redirect articles were removed, the formatting was pared down to the bare minimum, and it was all compressed quite efficiently to fit in such a small space. Patrick gives more technical detail on an earlier version of the app&#x27;s homepage: <a href="https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20080523222440&#x2F;http:&#x2F;&#x2F;collison.ie&#x2F;wikipedia-iphone&#x2F;" rel="nofollow">https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20080523222440&#x2F;http:&#x2F;&#x2F;collison.i...</a>
评论 #33115881 未加载
评论 #33119377 未加载
评论 #33116103 未加载
评论 #33116077 未加载
评论 #33117209 未加载
armagon超过 2 年前
FYI, the internet archive hosts a ZIM archive that has dumps of wikipedia and many other works. <a href="https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;zimarchive" rel="nofollow">https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;zimarchive</a><p>I wish it was a little more obvious how to search it, or what all the variations mean, but it looks like a valuable resource.<p>It is worth noting that Kiwix works on multiple OSes and on phones and has a wifi hostspot version (that you might run on an raspberry pi, for example). Internet-in-a-box similarly works as a wifi hostspot for ZOM archives.<p>Lastly, it is worth mentioning that there are tools for creating your own ZIM files; it looks like the most straightforward way is to take a static website and use a utility to convert it into one self-contained file.
评论 #33117576 未加载
bombcar超过 2 年前
Kiwix is great - I have a collection of various things from their library <a href="https:&#x2F;&#x2F;library.kiwix.org&#x2F;?lang=eng" rel="nofollow">https:&#x2F;&#x2F;library.kiwix.org&#x2F;?lang=eng</a> downloaded for when I&#x27;m on a plane or the internet is otherwise unavailable.<p>That and the TeXlive PDF manuals can get me through anything.
评论 #33114989 未加载
评论 #33116729 未加载
评论 #33115580 未加载
评论 #33115476 未加载
ernst_mulder超过 2 年前
Some time ago I dreamt that I was in an alien space ship for some reason. Still carrying my phone and laptop bag. They were a friendly lot and asked whether or not I would like to charge my laptop. Do you have 220V sockets I asked. They didn&#x27;t know what that was. So I needed measurements and definitions. An approximate meter, an approximate second. Coulomb was difficult. I woke up and downloaded Wikipedia the next day. Deleted it again later for lack of harddisk space...<p>But next time this happens I will have an USB stick with all the necessary knowledge. The definitions for voltage, current and frequency should however be printed out in case my laptop battery charge is insufficient for accessing the USB stick.
评论 #33119249 未加载
thakoppno超过 2 年前
Somewhere around the original ipad era, I believe there was a curated subset of wikipedia articles that may have been called something like Educator’s Edition.<p>It worked offline and had images and I traveled to Peru with it and learned so much. Does anyone remember this sort of thing?<p>I’ve tried wix formatted copies and they do work but the experience on an offline ipad was simply better. Thanks in advance.
评论 #33114688 未加载
r3trohack3r超过 2 年前
Kiwix is an amazing project.<p>I used a similar approach for <a href="https:&#x2F;&#x2F;wikiscroll.blankenship.io" rel="nofollow">https:&#x2F;&#x2F;wikiscroll.blankenship.io</a><p>1. kiwix dump<p>2. unpack to HTML<p>3. process with cheerio to create json files<p>4. Create git repo and push to github pages<p>Works well for infinitely scrolling content, it&#x27;s just Math.random on top of static files.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;retrohacker&#x2F;wikiscroll" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;retrohacker&#x2F;wikiscroll</a>
评论 #33115924 未加载
评论 #33116767 未加载
orliesaurus超过 2 年前
Oh wow, I thought this was gonna be a REALLY large file, but only 95GB not bad, some worthless videogames are larger haha
评论 #33114716 未加载
评论 #33114903 未加载
评论 #33115274 未加载
评论 #33115224 未加载
stewbrew超过 2 年前
How can someone use so many words to say &quot;use kiwix&quot;.
jabbany超过 2 年前
I recall doing such an offline dump with Wikitaxi (<a href="https:&#x2F;&#x2F;www.yunqa.de&#x2F;delphi&#x2F;apps&#x2F;wikitaxi&#x2F;index" rel="nofollow">https:&#x2F;&#x2F;www.yunqa.de&#x2F;delphi&#x2F;apps&#x2F;wikitaxi&#x2F;index</a>) back when WP was getting banned in China a decade or so ago.<p>IIRC the articles were rather easy to download and convert even on my early 2000s netbook. The media (pictures, video, audio) though was painful to deal with, and it didn&#x27;t take long to find out that Wikipedia without diagram s and figures was not a great experience.
评论 #33115323 未加载
jokoon超过 2 年前
So can it remove things like movies and tv shows and other noise?<p>I remember there was some work done to categorize articles like with the Dewey system, but so far, you can&#x27;t really reduce the size of those exports.<p>Of course it would require a lot of work. Maybe it&#x27;s already possible to categorize articles of they belong to a &quot;portal&quot;.<p>But yeah, it doesn&#x27;t seem the Wikipedia foundation really care about those kind of problems. To be fair they lack money.
评论 #33120036 未加载
sqrt_1超过 2 年前
Article mentions to format to exFat as NTFS has a 4GB limit - I don&#x27;t think that is true.
评论 #33114998 未加载
评论 #33114862 未加载
评论 #33115041 未加载
yieldcrv超过 2 年前
protip: you need to download wikipedia in other languages as well<p>they are not translations, they are completely different articles under the name brand and platform of Wikipedia<p>an entry that may be just a blurb in English may be one of the most comprehensive and fully fleshed out and researched entries on the site in German, for example
pupppet超过 2 年前
Can anyone recommend a hardy device for viewing the content? As nutty as it sounds, in some post-apocalyptic world it would sure be nice to have. I&#x27;d keep it under the bed just in case..
评论 #33114610 未加载
评论 #33116293 未加载
评论 #33114695 未加载
评论 #33114568 未加载
评论 #33114636 未加载
评论 #33114615 未加载
评论 #33114835 未加载
barbs超过 2 年前
Is there a portable version of Kiwix? Would be cool if you could plug the USB into any computer and start reading Wikipedia without having to install anything.
评论 #33114501 未加载
jamylak超过 2 年前
I recently discovered <a href="https:&#x2F;&#x2F;github.com&#x2F;tatuylonen&#x2F;wiktextract" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tatuylonen&#x2F;wiktextract</a> wiktextract for wiktionary and <a href="https:&#x2F;&#x2F;kaikki.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;kaikki.org&#x2F;</a> kaikki has the extracts available as links, but it&#x27;s only en wiktionary for now.
peter_d_sherman超过 2 年前
&gt;&quot;After reading this article, you’ll be able to save all ~6 million pages of Wikipedia so you can access the sum of human knowledge regardless of internet connection!&quot;<p>[...]<p>&gt;&quot;The current Wikipedia file dump in English is around 95 GB in size. This means you’ll need something like a 128 GB flash drive to accommodate the large file size.&quot;<p>Great article!<p>Also, on a related note, there&#x27;s an interesting philosophical question related to this:<p>Given the task of preserving the most important human knowledge from the Internet and given a certain limited amount of computer storage -- what specific content (which could include text, pictures, web pages, PDFs, videos, technical drawings, etc.) from what sources do you select, and why?<p>?<p>So first with 100GB (All of Wikipedia is a great choice, btw!) -- but then with only 10GB, then 1GB, then 100MB, then 10MB, then 1MB, etc. -- all the way down to 64K! (about what an early microcomputer could hold on a floppy disk...)<p>What information do you select for each storage amount, and why?<p>?<p>(Perhaps I should make this a future interview question at my future company!)<p>Anyway, great article!
smukherjee19超过 2 年前
Wow, this is so cool! 95 GB and I can browse the entire Wikipedia offline!? Thanks so much!<p><a href="https:&#x2F;&#x2F;library.kiwix.org&#x2F;?lang=eng" rel="nofollow">https:&#x2F;&#x2F;library.kiwix.org&#x2F;?lang=eng</a><p>I was looking at what other sites are available, and seems there are quite a few. Are there any specific ones apart from Wikipedia that HN readers would recommend?
评论 #33116678 未加载
sjducb超过 2 年前
There is a ZIM file that contains all of stack overflow. Super useful if you have to program without access to the internet.
icod超过 2 年前
But but if all of Wikipedia fits on a USB drive, what do they need the millions and millions of Dollars for? &#x2F;s
squarefoot超过 2 年前
How does this scale with the need to update data with time, corrections etc? Having to download everything again doesn&#x27;t seem that elegant. I think this wold benefit a lot from some form of incremental backup support, that is, download only what was changed since last time. A possible implementation of that could be a bittorrent distributed git-like mirror so that everyone could maintain their local synced one and be able to create its snapshot on removable media on the fly.
评论 #33123290 未加载
e-clinton超过 2 年前
Do you think Apple would approve an app that just offlines Wikipedia?
评论 #33117577 未加载
评论 #33116112 未加载
bArray超过 2 年前
I think there are better ways to open ZIM files. I&#x27;ve had massive trouble with Kiwix. The old version seems broke beyond repair and the new version is too heavy.<p>ZIMply on branch `version2` has worked pretty well for me [1]. The search works a lot better and it&#x27;s really nicely formatted.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;kimbauters&#x2F;ZIMply&#x2F;tree&#x2F;version2" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kimbauters&#x2F;ZIMply&#x2F;tree&#x2F;version2</a>
rbistolfi超过 2 年前
There is also CDPedia, a project from Python Argentina originally intended for making Wikipedia available in rural schools without Internet connection. <a href="https:&#x2F;&#x2F;github.com&#x2F;PyAr&#x2F;CDPedia" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;PyAr&#x2F;CDPedia</a> <a href="http:&#x2F;&#x2F;cdpedia.python.org.ar&#x2F;index.en.es.html" rel="nofollow">http:&#x2F;&#x2F;cdpedia.python.org.ar&#x2F;index.en.es.html</a>
bori5超过 2 年前
Apropos of nothing stumbled upon encyclopedia Britannica the other day, anyone know what’s up with that and if there are any pros to it vs Wikipedia ?
评论 #33118285 未加载
sixhobbits超过 2 年前
The Library page has three identical looking entries, 100gb, 50gb, and 15gb without any explanation about what is or isn&#x27;t included in each
CGamesPlay超过 2 年前
Can anyone explain to me how the kiwix library site works? There’s 3 Wikipedia listings that all have the same name, description, language, and author, but seem to have different content. This pattern repeats for the “Wikipedia 0.8” and “Wikipedia 100” sets. One of the latter says that the top 100 pages on Wikipedia require 889 MB? What’s going on here?
londons_explore超过 2 年前
Note that it&#x27;s possible to make wikipedia substantially smaller if you&#x27;re happy to use more aggressive compression algorithms.<p>Kiwix divides the data into chunks and adds various indexes and stuff to allow searching data and fast access, even on slow CPU devices. But if you can live with slow loading, you can probably halve the storage space required, or maybe more.
评论 #33115344 未加载
评论 #33116226 未加载
krasi0超过 2 年前
And if you&#x27;re only interested in preserving just some Wiki pages, this browser extension with some automation on top will do the perfect job: <a href="https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;SingleFile" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;SingleFile</a> No affiliation, just a happy user :)
kloch超过 2 年前
I wonder if there is an offline backup of Wikipedia on ISS? There should be. And on every manned space mission.
评论 #33115754 未加载
评论 #33114811 未加载
评论 #33114643 未加载
quickthrower2超过 2 年前
I wonder if they snapshot Wikipedia for this, or if they stagger it per article to avoid very recent unreviewed edits getting in to such a download (that would say disappear off the site if those were bad edits or vandalism)
评论 #33117132 未加载
jscipione超过 2 年前
Do not store 96gb of anything on exfat, use ext4 or APFS or zfs or some journaled file system. Does NTFS really have a 4GB file size limit? Structures should match exfat so that part seems suspect to me.
评论 #33115702 未加载
评论 #33116313 未加载
评论 #33115532 未加载
dangrie158超过 2 年前
In the old days :tm: I remember doing this as well with a 1GB drive ( and room to spare for some mobile apps).<p>Would be interesting to see a graph of usb size easily available vs. Wikipedia dump size.
评论 #33117982 未加载
breck超过 2 年前
Love it! Imagine if USB Flash drive manufacturers just loaded up new drives with content like this. I mean, why not right? I think the physics means it would even be lighter ;)
评论 #33116268 未加载
评论 #33115670 未加载
kibwen超过 2 年前
Now I&#x27;m curious: if, hypothetically, wikipedia was just backed by a single git repo and every edit was a commit, how big would it be and how long would it take to clone?
PaulDavisThe1st超过 2 年前
Can someone explain what the role of kiwix in all this, please?
评论 #33115367 未加载
ryanmercer超过 2 年前
But why? If civilization collapses I&#x27;m not going to think &quot;oh, let me consult Wikipedia&quot; I&#x27;m going to think &quot;man, this sucks&quot;.
SargeDebian超过 2 年前
This has to be one of the most poorly structured pieces of writing I&#x27;ve seen in a while. It&#x27;s way too verbose, and on the one hand there are separate sections like:<p>* Getting a flash drive<p>* Formatting a flash drive (which includes a subsection on not formatting it but buying one that&#x27;s already formatted instead, while there was a separate section just before this one on buying a drive)<p>* Waiting for a file to download<p>At the same time downloading both Wikipedia and Kiwix are in the same section. Then, installing Kiwix is in a section called &quot;You&#x27;re done&quot; which isn&#x27;t next to the section on downloading Kiwix.
评论 #33118802 未加载
milkshakes超过 2 年前
I want to like Kiwix -- I downloaded Wikipedia AND StackOverflow -- but it keeps crashing every time I try to search for anything on this M1 macbook.
colordrops超过 2 年前
Is there a way to keep a mirror that stays in sync?
评论 #33114605 未加载
blue1超过 2 年前
Does it include the images or it’s just the text?
评论 #33114332 未加载
评论 #33114360 未加载
garfield322超过 2 年前
This would be useful to drop into North Korea?
评论 #33116434 未加载
iamwil超过 2 年前
I think I&#x27;d rather have stack overflow offline, before I&#x27;d want wikipedia offline, though.
_int3_超过 2 年前
Anyone doing ZIM of news.ycombinator.com ? Once in a week package would be fine. How to make one?
haolez超过 2 年前
Could I use something like this to train my own GPT that&#x27;s obsessed with Wikipedia? :)
jhatemyjob超过 2 年前
Is there something like this that downloads the full edit history as well?
felipelalli超过 2 年前
What about the images?
mellowhype超过 2 年前
would be cool if kiwix came with an auto-update feature, but given the database size, I believe it&#x27;s difficult to implement.
porbelm超过 2 年前
95 GB? I remember when it was like 2 GB haha
sprash超过 2 年前
Is there something similar for Stack Overflow?
评论 #33114984 未加载
评论 #33114931 未加载
评论 #33116340 未加载
评论 #33114960 未加载
评论 #33115287 未加载
gbraad超过 2 年前
Still using a WikiReader?
yCloser超过 2 年前
and now donate to Wikipedia, because you just caused them to pay for 95Gb of (useless) traffic
评论 #33129506 未加载
mikotodomo超过 2 年前
Cool. I don&#x27;t have a USB Flash Drive though.
评论 #33116150 未加载