TechEcho

14 comments

codevikingover 3 years ago

Hi all,I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.We’re eager to hear what you think, and happy to answer questions.

评论 #28545001 未加载

评论 #28547417 未加载

评论 #28550215 未加载

评论 #28544147 未加载

评论 #28549759 未加载

评论 #28544130 未加载

nanisover 3 years ago

This seems pdf2tohtml combined with GROBID[1].It seems to me the masheen learningz technikz boil down to a generalization of my lightbulb moment here[2].[1]: <a href="https://grobid.readthedocs.io/en/latest/" rel="nofollow">https://grobid.readthedocs.io/en/latest/</a>[2]: <a href="https://www.nu42.com/2014/09/scraping-pdf-documents-without-losing.html" rel="nofollow">https://www.nu42.com/2014/09/scraping-pdf-documents-without-...</a>

评论 #28544600 未加载

评论 #28563513 未加载

oolonthegreatover 3 years ago

cool project, though the name was confusing for me: I believe to most people "paper" first means actual paper, so I thought this was some kind of OCR system converting printed material to html?

评论 #28545182 未加载

gregsadetskyover 3 years ago

Great site, congrats!One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)I see 421 requests and 150 Mb loaded. As it seems to be mostly thumbnails, have you considered using jpegs instead of pngs, potentially use lazy loading (i.e. not load images outside of the viewport) and potentially use GCP's (or another provider) CDN offering?Once I clicked a thumbnail, loading the article itself (for example [1]) was quite breezy.The gallery is a great showcase of what your site does -- I think that it'd be worth making it snappier :-)Cheers and congrats againP.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"[0] <a href="https://papertohtml.org/gallery" rel="nofollow">https://papertohtml.org/gallery</a>[1] <a href="https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba7aaae9f0a2e1c" rel="nofollow">https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...</a>

评论 #28545177 未加载

评论 #28546778 未加载

chrisMyzelover 3 years ago

This is amazing! Will make my (offline-only) Kindle finally display scientific papers. Took a random link of arxiv and it worked like a charm, including TOC. will this be OS'ed?

评论 #28544890 未加载

评论 #28545754 未加载

评论 #28545228 未加载

评论 #28545284 未加载

评论 #28544857 未加载

p4bl0over 3 years ago

I tried that a few days ago with one of my papers (a PDF generated using pdflatex) and it didn't work that well: the text was fine but some section titles were off, and all of the math and code parts were broken.But clearly it is a nice idea and I can't wait that such tools work better!

评论 #28545200 未加载

Klasiasterover 3 years ago

For non-reflow conversion there is pdf2htmlEX: <a href="https://github.com/coolwanglu/pdf2htmlEX" rel="nofollow">https://github.com/coolwanglu/pdf2htmlEX</a> is discontinued but there is development under <a href="https://github.com/pdf2htmlEX/pdf2htmlEX" rel="nofollow">https://github.com/pdf2htmlEX/pdf2htmlEX</a>Demo: <a href="https://pdf2htmlex.github.io/pdf2htmlEX/doc/tb108wang.html" rel="nofollow">https://pdf2htmlex.github.io/pdf2htmlEX/doc/tb108wang.html</a>

kartoshechkaover 3 years ago

Looks exactly like what type of crunch work ML would do, but have you considered using brute force converters like latexml or pandoc where appropriate?

NmAmDaover 3 years ago

I tried several physics papers and none of them had any equation extracted. Is it by design have problems with LaTeX equations?

评论 #28547301 未加载

jimmySixDOFover 3 years ago

I am so amazed at the work you guys are doing at AI2 & the Semantic Scholar project. You guys are really fixing a broken system of research and discovery which suffers from organization design principles based on university library index card filing cabinets as magnified by the exponential content growth.Cant wait to see what people do with this . . . .

评论 #28545402 未加载

weystromover 3 years ago

When are, as people, are going to ditch PDF? It's an awful format.My friend wrote his PHD in Latex, but it all ends up being PDFed anyway for what, eye candy?It's time to move on. #ditchpdf

tailspin2019over 3 years ago

Haven't tried it yet, but a very cool concept.As per other recent discussions on HN I think the general accessibility of academic papers is ripe for improvement.

Orionosover 3 years ago

Please make it popular in the research field so you can spin up your own Sci-Hub!

johnhenryover 3 years ago

Retro mode should be default.

评论 #28547295 未加载

14 comments

codevikingover 3 years ago

评论 #28545001 未加载

评论 #28547417 未加载

评论 #28550215 未加载

评论 #28544147 未加载

评论 #28549759 未加载

评论 #28544130 未加载

nanisover 3 years ago

评论 #28544600 未加载

评论 #28563513 未加载

oolonthegreatover 3 years ago

cool project, though the name was confusing for me: I believe to most people "paper" first means actual paper, so I thought this was some kind of OCR system converting printed material to html?

评论 #28545182 未加载

gregsadetskyover 3 years ago

评论 #28545177 未加载

评论 #28546778 未加载

chrisMyzelover 3 years ago

This is amazing! Will make my (offline-only) Kindle finally display scientific papers. Took a random link of arxiv and it worked like a charm, including TOC. will this be OS'ed?

评论 #28544890 未加载

评论 #28545754 未加载

评论 #28545228 未加载

评论 #28545284 未加载

评论 #28544857 未加载

p4bl0over 3 years ago

评论 #28545200 未加载

Klasiasterover 3 years ago

kartoshechkaover 3 years ago

Looks exactly like what type of crunch work ML would do, but have you considered using brute force converters like latexml or pandoc where appropriate?

NmAmDaover 3 years ago

I tried several physics papers and none of them had any equation extracted. Is it by design have problems with LaTeX equations?

评论 #28547301 未加载

jimmySixDOFover 3 years ago

评论 #28545402 未加载

weystromover 3 years ago

When are, as people, are going to ditch PDF? It's an awful format.My friend wrote his PHD in Latex, but it all ends up being PDFed anyway for what, eye candy?It's time to move on. #ditchpdf

tailspin2019over 3 years ago

Haven't tried it yet, but a very cool concept.As per other recent discussions on HN I think the general accessibility of academic papers is ripe for improvement.

Orionosover 3 years ago

Please make it popular in the research field so you can spin up your own Sci-Hub!

johnhenryover 3 years ago

Retro mode should be default.

评论 #28547295 未加载

Show HN: Paper to HTML Converter

14 comments

Show HN: Paper to HTML Converter

14 comments