TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Paper to HTML Converter

153 pointsby codevikingover 3 years ago

14 comments

codevikingover 3 years ago
Hi all,<p>I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.<p>Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.<p>We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.<p>We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.<p>We’re eager to hear what you think, and happy to answer questions.
评论 #28545001 未加载
评论 #28547417 未加载
评论 #28550215 未加载
评论 #28544147 未加载
评论 #28549759 未加载
评论 #28544130 未加载
nanisover 3 years ago
This seems pdf2tohtml combined with GROBID[1].<p>It seems to me the masheen learningz technikz boil down to a generalization of my lightbulb moment here[2].<p>[1]: <a href="https:&#x2F;&#x2F;grobid.readthedocs.io&#x2F;en&#x2F;latest&#x2F;" rel="nofollow">https:&#x2F;&#x2F;grobid.readthedocs.io&#x2F;en&#x2F;latest&#x2F;</a><p>[2]: <a href="https:&#x2F;&#x2F;www.nu42.com&#x2F;2014&#x2F;09&#x2F;scraping-pdf-documents-without-losing.html" rel="nofollow">https:&#x2F;&#x2F;www.nu42.com&#x2F;2014&#x2F;09&#x2F;scraping-pdf-documents-without-...</a>
评论 #28544600 未加载
评论 #28563513 未加载
oolonthegreatover 3 years ago
cool project, though the name was confusing for me: I believe to most people &quot;paper&quot; first means actual paper, so I thought this was some kind of OCR system converting printed material to html?
评论 #28545182 未加载
gregsadetskyover 3 years ago
Great site, congrats!<p>One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)<p>I see 421 requests and 150 Mb loaded. As it seems to be mostly thumbnails, have you considered using jpegs instead of pngs, potentially use lazy loading (i.e. not load images outside of the viewport) and potentially use GCP&#x27;s (or another provider) CDN offering?<p>Once I clicked a thumbnail, loading the article itself (for example [1]) was quite breezy.<p>The gallery is a great showcase of what your site does -- I think that it&#x27;d be worth making it snappier :-)<p>Cheers and congrats again<p>P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see &quot;EQUATION (1): Not extracted; please refer to original document&quot;, and also some (formula? Greek?) characters that seem out of place after the words &quot;and the next token is generated by sampling&quot;<p>[0] <a href="https:&#x2F;&#x2F;papertohtml.org&#x2F;gallery" rel="nofollow">https:&#x2F;&#x2F;papertohtml.org&#x2F;gallery</a><p>[1] <a href="https:&#x2F;&#x2F;papertohtml.org&#x2F;paper?id=02f033482b8045c687316ef81ba7aaae9f0a2e1c" rel="nofollow">https:&#x2F;&#x2F;papertohtml.org&#x2F;paper?id=02f033482b8045c687316ef81ba...</a>
评论 #28545177 未加载
评论 #28546778 未加载
chrisMyzelover 3 years ago
This is amazing! Will make my (offline-only) Kindle finally display scientific papers. Took a random link of arxiv and it worked like a charm, including TOC. will this be OS&#x27;ed?
评论 #28544890 未加载
评论 #28545754 未加载
评论 #28545228 未加载
评论 #28545284 未加载
评论 #28544857 未加载
p4bl0over 3 years ago
I tried that a few days ago with one of my papers (a PDF generated using pdflatex) and it didn&#x27;t work that well: the text was fine but some section titles were off, and all of the math and code parts were broken.<p>But clearly it is a nice idea and I can&#x27;t wait that such tools work better!
评论 #28545200 未加载
Klasiasterover 3 years ago
For non-reflow conversion there is pdf2htmlEX: <a href="https:&#x2F;&#x2F;github.com&#x2F;coolwanglu&#x2F;pdf2htmlEX" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;coolwanglu&#x2F;pdf2htmlEX</a> is discontinued but there is development under <a href="https:&#x2F;&#x2F;github.com&#x2F;pdf2htmlEX&#x2F;pdf2htmlEX" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;pdf2htmlEX&#x2F;pdf2htmlEX</a><p>Demo: <a href="https:&#x2F;&#x2F;pdf2htmlex.github.io&#x2F;pdf2htmlEX&#x2F;doc&#x2F;tb108wang.html" rel="nofollow">https:&#x2F;&#x2F;pdf2htmlex.github.io&#x2F;pdf2htmlEX&#x2F;doc&#x2F;tb108wang.html</a>
kartoshechkaover 3 years ago
Looks exactly like what type of crunch work ML would do, but have you considered using brute force converters like latexml or pandoc where appropriate?
NmAmDaover 3 years ago
I tried several physics papers and none of them had any equation extracted. Is it by design have problems with LaTeX equations?
评论 #28547301 未加载
jimmySixDOFover 3 years ago
I am so amazed at the work you guys are doing at AI2 &amp; the Semantic Scholar project. You guys are really fixing a broken system of research and discovery which suffers from organization design principles based on university library index card filing cabinets as magnified by the exponential content growth.<p>Cant wait to see what people do with this . . . .
评论 #28545402 未加载
weystromover 3 years ago
When are, as people, are going to ditch PDF? It&#x27;s an awful format.<p>My friend wrote his PHD in Latex, but it all ends up being PDFed anyway for what, eye candy?<p>It&#x27;s time to move on. #ditchpdf
tailspin2019over 3 years ago
Haven&#x27;t tried it yet, but a very cool concept.<p>As per other recent discussions on HN I think the general accessibility of academic papers is ripe for improvement.
Orionosover 3 years ago
Please make it popular in the research field so you can spin up your own Sci-Hub!
johnhenryover 3 years ago
Retro mode should be default.
评论 #28547295 未加载