Hi all,<p>I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.<p>Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.<p>We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.<p>We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.<p>We’re eager to hear what you think, and happy to answer questions.
This seems pdf2tohtml combined with GROBID[1].<p>It seems to me the masheen learningz technikz boil down to a generalization of my lightbulb moment here[2].<p>[1]: <a href="https://grobid.readthedocs.io/en/latest/" rel="nofollow">https://grobid.readthedocs.io/en/latest/</a><p>[2]: <a href="https://www.nu42.com/2014/09/scraping-pdf-documents-without-losing.html" rel="nofollow">https://www.nu42.com/2014/09/scraping-pdf-documents-without-...</a>
cool project, though the name was confusing for me: I believe to most people "paper" first means actual paper, so I thought this was some kind of OCR system converting printed material to html?
Great site, congrats!<p>One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)<p>I see 421 requests and 150 Mb loaded. As it seems to be mostly thumbnails, have you considered using jpegs instead of pngs, potentially use lazy loading (i.e. not load images outside of the viewport) and potentially use GCP's (or another provider) CDN offering?<p>Once I clicked a thumbnail, loading the article itself (for example [1]) was quite breezy.<p>The gallery is a great showcase of what your site does -- I think that it'd be worth making it snappier :-)<p>Cheers and congrats again<p>P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"<p>[0] <a href="https://papertohtml.org/gallery" rel="nofollow">https://papertohtml.org/gallery</a><p>[1] <a href="https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba7aaae9f0a2e1c" rel="nofollow">https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...</a>
This is amazing! Will make my (offline-only) Kindle finally display scientific papers. Took a random link of arxiv and it worked like a charm, including TOC. will this be OS'ed?
I tried that a few days ago with one of my papers (a PDF generated using pdflatex) and it didn't work that well: the text was fine but some section titles were off, and all of the math and code parts were broken.<p>But clearly it is a nice idea and I can't wait that such tools work better!
For non-reflow conversion there is pdf2htmlEX: <a href="https://github.com/coolwanglu/pdf2htmlEX" rel="nofollow">https://github.com/coolwanglu/pdf2htmlEX</a> is discontinued but there is development under <a href="https://github.com/pdf2htmlEX/pdf2htmlEX" rel="nofollow">https://github.com/pdf2htmlEX/pdf2htmlEX</a><p>Demo: <a href="https://pdf2htmlex.github.io/pdf2htmlEX/doc/tb108wang.html" rel="nofollow">https://pdf2htmlex.github.io/pdf2htmlEX/doc/tb108wang.html</a>
Looks exactly like what type of crunch work ML would do, but have you considered using brute force converters like latexml or pandoc where appropriate?
I am so amazed at the work you guys are doing at AI2 & the Semantic Scholar project. You guys are really fixing a broken system of research and discovery which suffers from organization design principles based on university library index card filing cabinets as magnified by the exponential content growth.<p>Cant wait to see what people do with this . . . .
When are, as people, are going to ditch PDF? It's an awful format.<p>My friend wrote his PHD in Latex, but it all ends up being PDFed anyway for what, eye candy?<p>It's time to move on. #ditchpdf
Haven't tried it yet, but a very cool concept.<p>As per other recent discussions on HN I think the general accessibility of academic papers is ripe for improvement.