TechEcho

13 comments

ivan_ahover 3 years ago

Very cool.The take-any-webpage-offline need is also common in the education space (teachers want to save a webpage and send it to their students as part of a lesson and don't want to worry about availability or ads etc).I used to work on tools for this <a href="https://github.com/learningequality/ricecooker/blob/develop/ricecooker/utils/downloader.py#L205-L502" rel="nofollow">https://github.com/learningequality/ricecooker/blob/develop/...</a> and <a href="https://github.com/learningequality/BasicCrawler/blob/master/basiccrawler/crawler.py#L286-L382" rel="nofollow">https://github.com/learningequality/BasicCrawler/blob/master...</a> which worked quite well for most sites, but still very far from a general-purpose solution.There is also more powerful/general-purpose scraper that generates a ZIM file here: <a href="https://github.com/openzim/zimit" rel="nofollow">https://github.com/openzim/zimit</a>It would be really nice to a "common" scraper code base that takes care of scraping (possibly with a real headless browser) and outputs all assets as files + info as JSON. This common code base could then be used by all kinds of programs to package the content as standalone HTML zip files, ePub, ZIM, or even PDF for crazy people like me who like to print things ;)

评论 #28827003 未加载

评论 #28821534 未加载

captn3m0over 3 years ago

I do a lot of this work[3] (web to documents) and it's interesting to see other approaches. The medium image problem is something I've faced as well, but never got around to fixing. I'm planning to get a Remarkable soon, so will definitely be trying this out.My personal solution has been <a href="https://github.com/captn3m0/url-to-epub/" rel="nofollow">https://github.com/captn3m0/url-to-epub/</a> (Node/readability), which I've tested against the entirety of Tor's original fiction collection[0] where it performs well enough (I'm biased). Another tool that does this beautifully well is percollate[1], but it doesn't give enough control of the metadata to the user - something I really care about.I've also started to use rdrview[2], which is a C-port of the current Firefox implementation of "reader view". It is very unix-y, so it is easy to pipe content to it (I usually run it through tidy first). Quite helpful in building web-archiving or web-to-pdf or web-to-kindle pipelines easily.[0]: <a href="https://www.tor.com/category/all-fiction/original-fiction/" rel="nofollow">https://www.tor.com/category/all-fiction/original-fiction/</a>[1]: <a href="https://github.com/danburzo/percollate" rel="nofollow">https://github.com/danburzo/percollate</a>[2]: <a href="https://github.com/eafer/rdrview" rel="nofollow">https://github.com/eafer/rdrview</a>[3]: <a href="https://captnemo.in/ebooks/" rel="nofollow">https://captnemo.in/ebooks/</a>

atsaloliover 3 years ago

I run "lynx --dump $URL | vim -" to read the text in Vim when the web page gets too cluttered (I use Vim as a pager because I know "Vim" better than "less").

评论 #28827014 未加载

评论 #28825431 未加载

Syonykover 3 years ago

How is this different from the Wallabag project, which, as I understand it (it's on my list of "Things to mess with at some point") does exactly the same thing - website to epub for offline reading?

bredrenover 3 years ago

Newspaper3k is a Python package I’m using to extract content from articles across the web.But it has not been maintained, since the author joined Facebook.It works alright, but it has many issues.If I understand correctly, a full on replacement for newspaper is in the wings, seeking to offer a sustainable content extraction tool in Python.But it isn’t ready yet. And some of the problems in this area mirror those faced by web scrapers.

phkxover 3 years ago

I‘ve been using pandoc to extract texts next to my notes (both in Markdown) in order to add links between them. I haven’t extracted too many pages yet, but the results were reasonable so far, although sometimes lots of html tags remain. Also, none of them contained any math so far.

marbanover 3 years ago

Needless to say that extractability hasn't gotten easier in recent years but I'm even more concerned about archive.org's quality/capabilities — They really need to step up their game to remain useful in this area.

gcrover 3 years ago

Calibre supports getpocket via a plugin that you can add from the app. Then, you can click the "Get News" button to download all the articles from your Pocket feed into your eBook reader at once.

owulveryckover 3 years ago

This is a post about a tool I am building to generate an epub from a website

评论 #28820055 未加载

haroldtreenover 3 years ago

I built a Chrome Extension that does this exact thing :). There's also a WebAPI.<a href="https://epub.press/" rel="nofollow">https://epub.press/</a>

alexmcc81over 3 years ago

On a related note, does anyone know of any open source project that could parse the website text and generate a list of tags?

评论 #28820669 未加载

mro_nameover 3 years ago

occasionally I use <a href="https://github.com/gildas-lormeau/SingleFile" rel="nofollow">https://github.com/gildas-lormeau/SingleFile</a>

notionparallaxover 3 years ago

I've been making something for this for a couple of years now, with <a href="http://waldenpond.press/" rel="nofollow">http://waldenpond.press/</a>It connects to the Pocket API to get the parsed articles, pushes them through quite a lot of BS4 clean up, then renders them using paged.js. The resulting PDFs are then printed by Lulu.com, and they come once a month as a printed book to read completely offline.I solved the Medium image issue with CSS as far as I remember. `.medium\.com svg:first-of-type` and then set it to `display: none`.

13 comments

ivan_ahover 3 years ago

评论 #28827003 未加载

评论 #28821534 未加载

captn3m0over 3 years ago

atsaloliover 3 years ago

I run "lynx --dump $URL | vim -" to read the text in Vim when the web page gets too cluttered (I use Vim as a pager because I know "Vim" better than "less").

评论 #28827014 未加载

评论 #28825431 未加载

Syonykover 3 years ago

How is this different from the Wallabag project, which, as I understand it (it's on my list of "Things to mess with at some point") does exactly the same thing - website to epub for offline reading?

bredrenover 3 years ago

phkxover 3 years ago

marbanover 3 years ago

gcrover 3 years ago

Calibre supports getpocket via a plugin that you can add from the app. Then, you can click the "Get News" button to download all the articles from your Pocket feed into your eBook reader at once.

owulveryckover 3 years ago

This is a post about a tool I am building to generate an epub from a website

评论 #28820055 未加载

haroldtreenover 3 years ago

I built a Chrome Extension that does this exact thing :). There's also a WebAPI.<a href="https://epub.press/" rel="nofollow">https://epub.press/</a>

alexmcc81over 3 years ago

On a related note, does anyone know of any open source project that could parse the website text and generate a list of tags?

评论 #28820669 未加载

mro_nameover 3 years ago

occasionally I use <a href="https://github.com/gildas-lormeau/SingleFile" rel="nofollow">https://github.com/gildas-lormeau/SingleFile</a>

notionparallaxover 3 years ago

Reading the web offline and distraction-free

13 comments

Reading the web offline and distraction-free

13 comments