Very cool.<p>The take-any-webpage-offline need is also common in the education space (teachers want to save a webpage and send it to their students as part of a lesson and don't want to worry about availability or ads etc).<p>I used to work on tools for this <a href="https://github.com/learningequality/ricecooker/blob/develop/ricecooker/utils/downloader.py#L205-L502" rel="nofollow">https://github.com/learningequality/ricecooker/blob/develop/...</a> and <a href="https://github.com/learningequality/BasicCrawler/blob/master/basiccrawler/crawler.py#L286-L382" rel="nofollow">https://github.com/learningequality/BasicCrawler/blob/master...</a>
which worked quite well for most sites, but still very far from a general-purpose solution.<p>There is also more powerful/general-purpose scraper that generates a ZIM file here: <a href="https://github.com/openzim/zimit" rel="nofollow">https://github.com/openzim/zimit</a><p>It would be really nice to a "common" scraper code base that takes care of scraping (possibly with a real headless browser) and outputs all assets as files + info as JSON. This common code base could then be used by all kinds of programs to package the content as standalone HTML zip files, ePub, ZIM, or even PDF for crazy people like me who like to print things ;)
I do a lot of this work[3] (web to documents) and it's interesting to see other approaches. The medium image problem is something I've faced as well, but never got around to fixing. I'm planning to get a Remarkable soon, so will definitely be trying this out.<p>My personal solution has been <a href="https://github.com/captn3m0/url-to-epub/" rel="nofollow">https://github.com/captn3m0/url-to-epub/</a> (Node/readability), which I've tested against the entirety of Tor's original fiction collection[0] where it performs well enough (I'm biased). Another tool that does this beautifully well is percollate[1], but it doesn't give enough control of the metadata to the user - something I really care about.<p>I've also started to use rdrview[2], which is a C-port of the current Firefox implementation of "reader view". It is very unix-y, so it is easy to pipe content to it (I usually run it through tidy first). Quite helpful in building web-archiving or web-to-pdf or web-to-kindle pipelines easily.<p>[0]: <a href="https://www.tor.com/category/all-fiction/original-fiction/" rel="nofollow">https://www.tor.com/category/all-fiction/original-fiction/</a><p>[1]: <a href="https://github.com/danburzo/percollate" rel="nofollow">https://github.com/danburzo/percollate</a><p>[2]: <a href="https://github.com/eafer/rdrview" rel="nofollow">https://github.com/eafer/rdrview</a><p>[3]: <a href="https://captnemo.in/ebooks/" rel="nofollow">https://captnemo.in/ebooks/</a>
I run "lynx --dump $URL | vim -" to read the text in Vim when the web page gets too cluttered (I use Vim as a pager because I know "Vim" better than "less").
How is this different from the Wallabag project, which, as I understand it (it's on my list of "Things to mess with at some point") does exactly the same thing - website to epub for offline reading?
Newspaper3k is a Python package I’m using to extract content from articles across the web.<p>But it has not been maintained, since the author joined Facebook.<p>It works alright, but it has many issues.<p>If I understand correctly, a full on replacement for newspaper is in the wings, seeking to offer a sustainable content extraction tool in Python.<p>But it isn’t ready yet. And some of the problems in this area mirror those faced by web scrapers.
I‘ve been using pandoc to extract texts next to my notes (both in Markdown) in order to add links between them. I haven’t extracted too many pages yet, but the results were reasonable so far, although sometimes lots of html tags remain. Also, none of them contained any math so far.
Needless to say that extractability hasn't gotten easier in recent years but I'm even more concerned about archive.org's quality/capabilities — They really need to step up their game to remain useful in this area.
Calibre supports getpocket via a plugin that you can add from the app. Then, you can click the "Get News" button to download all the articles from your Pocket feed into your eBook reader at once.
I built a Chrome Extension that does this exact thing :).
There's also a WebAPI.<p><a href="https://epub.press/" rel="nofollow">https://epub.press/</a>
occasionally I use <a href="https://github.com/gildas-lormeau/SingleFile" rel="nofollow">https://github.com/gildas-lormeau/SingleFile</a>
I've been making something for this for a couple of years now, with <a href="http://waldenpond.press/" rel="nofollow">http://waldenpond.press/</a><p>It connects to the Pocket API to get the parsed articles, pushes them through quite a lot of BS4 clean up, then renders them using paged.js. The resulting PDFs are then printed by Lulu.com, and they come once a month as a printed book to read completely offline.<p>I solved the Medium image issue with CSS as far as I remember. `.medium\.com svg:first-of-type` and then set it to `display: none`.