Show HN: CLI for generating PDFs for offline reading

161 pointsby dvcoolarunover 1 year ago

I've always thought that extensive reading was best suited for the realm of paper. As a result, I've created a command-line interface (CLI) tailored for my own use and decided to make it open source. I welcome any feedback you may have.[Edit] Sample PDF :: <a href="https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53TK3k5E/view" rel="nofollow">https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53T...</a>

20 comments

ComputerGuruover 1 year ago

I feel like if you are claiming "beautiful" output then it's obligatory to have at the very least screenshots of said output PDFs (or better yet, a sample for the same link in the CLI screenshot, especially so people can see how the text flows, what quality images are captured at, how text can be selected, etc).

评论 #39266931 未加载

评论 #39266174 未加载

jackconsidineover 1 year ago

This is cool! I have a HN pipeline where I upvote things that I want to drill into, and a script I wrote generates PDFs and sends to my Kindle for offline reading (great for my pipeline). That uses Playwright's "to PDF" method which is over the browser and slow. I might look into replacing with this.If there's any interest I might OSS the pipeline

评论 #39271632 未加载

评论 #39266710 未加载

评论 #39276437 未加载

nacho2sweetover 1 year ago

We just use a headless chrome with a sort of wrapper script to do this at my work with a bunch of settings close to the actual size of paper. It allows me to test all of our reports in media->print in dev tools then print->pdf with chrome and only have to design to that spec. Then in our reports we provide a "save as pdf" button instead of encouraging print in all the other possible browsers which would make the task insane and cause me to possibly quit.

dvcoolarunover 1 year ago

Apologies for the oversight; I forgot to include the screenshot of the sample PDF. Here it is for your reference: <a href="https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53TK3k5E/view" rel="nofollow">https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53T...</a>

评论 #39266349 未加载

dvcoolarunover 1 year ago

Arr, this blew up! I think, in some form, people are missing the context of the script. It's a plug-and-play script where you can make changes to PDF quality using CSS/Python. Even fonts are loaded through Google in Python. 'Beautiful' is called contextual. You can create your own version and share it with the community.I'm on mobile, so I can't add a Google Drive file screenshot to the readme, and iframes are not supported.

pavsover 1 year ago

like this:<pre><code> sudo apt install pandoc wkhtmltopdf npm install -g readability-cli pandoc -s https://www.paulgraham.com/avg.html -o output.html && readable output.html -o readable.html && wkhtmltopdf readable.html output.pdf && open output.pdf </code></pre> going even further using bash script to prompt for url.<pre><code> #!/bin/bash # Prompt the user for a URL read -p "Enter the URL: " URL # Use the URL in the pandoc command pandoc -s $url -o output.html && readable output.html -o readable.html && wkhtmltopdf readable.html output.pdf && open output.pdf chmod +x web2pdf.sh # add an alias to bashrc alias web2pdf='/path/to/your/web2pdf.sh' source ~/.bashrc</code></pre>

评论 #39271230 未加载

评论 #39268676 未加载

评论 #39281609 未加载

评论 #39268501 未加载

seabass-labraxover 1 year ago

Very interesting! One piece of feedback: it would probably be more useful to have a screenshot of the PDF on your README rather than one of the CLI. Also, do you intend to release this as FOSS?

adrian_bover 1 year ago

Both Chrome and Firefox have absolutely horrible "Print" (to PDF) commands, which render the Web pages in a different way than what they show on the screen, and which results in large parts of the page being obscured by ads, menus, headers, etc., or in parts of the Web page that are outside the rendered area, so they are missing, or in content that is compressed to a small part of the output pages.It would be really nice if there existed a utility able to produce a PDF file where the Web pages are rendered as well as the browsers render them on the screen, without becoming confused even by complex scripts loaded by the page.The alternatives to "Print" (producing a PDF) are even worse. A screenshot has limited resolution and it loses the text. In the past "Save as ..." was the normal solution, but now even if you save a "complete" page, it will still frequently include scripts that will no longer work offline. What I want to save are the pages perfectly rendered as they were at that instant, without any scripts that could make them appear differently in the future.

评论 #39273684 未加载

Someoneover 1 year ago

FTA: “Then you can use the tool as follows<pre><code> pipenv shell pipenv install python main.py https://www.paulgraham.com/avg.html, https://www.paulgraham.com/determination.html </code></pre> Just add the webpage URLs separated by commas”What’s the rationale for “separated by commas”? The convention for CLI arguments is to use one argument per input file.

评论 #39267952 未加载

jll29over 1 year ago

<pre><code> % python main.py https://www.paulgraham.com/avg.html Traceback (most recent call last): File "/Users/bill/web2pdf/main.py", line 7, in <module> from readability import Document ImportError: cannot import name 'Document' from 'readability' (/Users/bill/.local/share/virtualenvs/web2pdf- gXeVRXKg/lib/python3.9/site-packages/readability/__init__.py) </code></pre> But according to your Pipfile.lock, the readability module needed is 0.3.1:<pre><code> "readability": { "hashes": [ "sha256:f9030df8bc31aad45baffa9a2d9ce1fdd8051833e5b5bda3027df32fdec00fad" ], "index": "pypi", "version": "==0.3.1" }, </code></pre> Version 0.3.1 of the module "readability" exists, but does not appear to have a class "Document".

评论 #39267195 未加载

OhMeadhbhover 1 year ago

Apropos of nothing, I added this function so I don't have to leave the command line to see the PDF.<pre><code> pdfpage() { convert -resize 0x1000^ "${1}"[${2}] -background white -flatten sixel:- } </code></pre> You can probably deduce it assumes you have a Imagemagick installed and you're in a terminal with sixel support.

fishywangover 1 year ago

Somewhat similarly, I wrote a web app to generate epub (instead of pdf) out of urls and send to eink reader(s) directly (via a telegram bot) so I can read them. Currently it supports sending epub by email (for kindle) or uploading epub to dropbox (for kobo, etc.). It originally also supports reMarkable cloud but we can no longer make reMarkable cloud actually work. There's also a REST api to generate epub to be downloaded directly: <a href="https://github.com/fishy/url2epub/blob/main/REST.md">https://github.com/fishy/url2epub/blob/main/REST.md</a>For e-ink readers epubs are generally better than PDFs for urls anyways, as epubs are basically packed htmls, and also the flow text works better on smaller screens.

Throw73747over 1 year ago

Parhaps add ublock filters support? I use it to strip down any unwanted elements on page before printing. On hacker news discussions it removes forms, reply links, header and footers...

rahimnathwaniover 1 year ago

For print or PDF, I like multi-column newspaper style, as created by this extension: <a href="https://chromewebstore.google.com/detail/simple-print/nalmbmopkipfhijmcncelapgbkgoligf" rel="nofollow">https://chromewebstore.google.com/detail/simple-print/nalmbm...</a>One benefit of using a Chrome extension (vs. CLI) is that it's easy to 'print' things that require authentication.

jll29over 1 year ago

Have you compared it with a conversion by pandoc (<a href="https://pandoc.org/" rel="nofollow">https://pandoc.org/</a>)?

评论 #39272104 未加载

sn0nover 1 year ago

Does it run a headless chrome for pixel perfect formatting as laid out as a webpage and applied in that format to PDF ignoring the pages print css rules? Cuz, that would be a nice start. And an option for size to be pixel width based for ideal layout... Because I won't be printing, I will be viewing on my phone, so one overly large page would be perfect.

harry8over 1 year ago

Webbrowser opens url -> print -> save as/to pdf?I'm sure I'm missing something, what is a cli interface buying me here?

K2hover 1 year ago

Very cool! in README.md is that an extra 'p' in Webp2pdf ?

codeonlineover 1 year ago

Can you add comparison pdfs generated by pandoc and gotenberg?

skangaover 1 year ago

Found some potential bugs. Please check the github issues page.

20 comments

ComputerGuruover 1 year ago

评论 #39266931 未加载

评论 #39266174 未加载

jackconsidineover 1 year ago

评论 #39271632 未加载

评论 #39266710 未加载

评论 #39276437 未加载

nacho2sweetover 1 year ago

dvcoolarunover 1 year ago

评论 #39266349 未加载

dvcoolarunover 1 year ago

pavsover 1 year ago

评论 #39271230 未加载

评论 #39268676 未加载

评论 #39281609 未加载

评论 #39268501 未加载

seabass-labraxover 1 year ago

Very interesting! One piece of feedback: it would probably be more useful to have a screenshot of the PDF on your README rather than one of the CLI. Also, do you intend to release this as FOSS?

adrian_bover 1 year ago

评论 #39273684 未加载

Someoneover 1 year ago

评论 #39267952 未加载

jll29over 1 year ago

评论 #39267195 未加载

OhMeadhbhover 1 year ago

fishywangover 1 year ago

Throw73747over 1 year ago

Parhaps add ublock filters support? I use it to strip down any unwanted elements on page before printing. On hacker news discussions it removes forms, reply links, header and footers...

rahimnathwaniover 1 year ago

jll29over 1 year ago

Have you compared it with a conversion by pandoc (<a href="https://pandoc.org/" rel="nofollow">https://pandoc.org/</a>)?

评论 #39272104 未加载

sn0nover 1 year ago

harry8over 1 year ago

Webbrowser opens url -> print -> save as/to pdf?I'm sure I'm missing something, what is a cli interface buying me here?

K2hover 1 year ago

Very cool! in README.md is that an extra 'p' in Webp2pdf ?

codeonlineover 1 year ago

Can you add comparison pdfs generated by pandoc and gotenberg?

skangaover 1 year ago

Found some potential bugs. Please check the github issues page.