ArXiv now offers papers in HTML format

1204 点作者 programd超过 1 年前

70 条评论

shrimpx超过 1 年前

Since the article doesn't link to any example HTML article, here's a random link:<a href="https://browse.arxiv.org/html/2312.12451v1" rel="nofollow noreferrer">https://browse.arxiv.org/html/2312.12451v1</a>It's cool that it has a dark mode. Didn't see a toggle but renders in the system mode.Overall will make arXiv a lot more accessible on mobile.

评论 #38725005 未加载

jez超过 1 年前

It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.- I can imagine authors feeling frustrated if someone reaches out about a problem in the HTML version of their paper, but they have no way to correct it except by hoping that a change to the PDF fixes a change to the generated HTML. Easier to just fix the formatting problem in the PDF outright.- It would be neat to allow people to experiment with alternative formatting for their papers. For example, imagine a paper about a programming language that embeds a sandbox you can use to play around with the language under discussion. Or a paper about multivariable calculus and you can interact with a three dimensional plot of some function.

评论 #38726711 未加载

评论 #38725681 未加载

评论 #38725671 未加载

评论 #38726104 未加载

评论 #38726355 未加载

评论 #38727314 未加载

评论 #38732945 未加载

svag超过 1 年前

The tool that it's being used for this offering is this one, <a href="https://github.com/arXiv/arxiv-readability">https://github.com/arXiv/arxiv-readability</a>, just to save a few clicks :)

评论 #38728175 未加载

评论 #38734039 未加载

评论 #38732257 未加载

评论 #38731767 未加载

injuly超过 1 年前

For anyone who needs it, arxiv-vanity is amazing: <a href="https://www.arxiv-vanity.com/" rel="nofollow noreferrer">https://www.arxiv-vanity.com/</a>

评论 #38738540 未加载

jll29超过 1 年前

It's a cool feature because it makes the papers more finable, more easily navigatable, easier to read online and faster to scroll through. I am also happy for blind people that they can more easily use ArXive with Braille readers now.(I'm still a fan of printing the PDFs, because I annotate on paper and refer to page numbers, but the HTML feature is in addition to PDF download, not a replacement.)One thing that still sucks (not ArXiv related though) is reading mathematical formulae on the Kindle - wonder if someone with rendering expertise could have a look into the MOBI format.

评论 #38731131 未加载

astrolx超过 1 年前

This is excellent news. Their HTML formatting is also more pleasant than the HTML articles offered by most journals in my field (e.g arXiv HTML footnotes displayed as sidenotes on large displays!)

tarboreus超过 1 年前

One of the reasons is to make the papers more accessible to people with disabilities, especially the blind. I participated in a conference they hosted on this a few months ago, I recommend taking a look at the recordings if you're interested in thinking on this.<a href="https://accessibility2023.arxiv.org/" rel="nofollow noreferrer">https://accessibility2023.arxiv.org/</a>

评论 #38725938 未加载

评论 #38729731 未加载

reqo超过 1 年前

A lot of AI/ML papers these days have an accompanying interactive page like [0], will we see anything like these now directly in arXive?[0] <a href="https://voyager.minedojo.org/" rel="nofollow noreferrer">https://voyager.minedojo.org/</a>

评论 #38726521 未加载

shusaku超过 1 年前

Seems like the references aren’t working very well.I really want journals to have two way links in a paper. I get google scholar alerts about certain papers being cited, and I want to skip to “why did they cite this? Did they use it, improve it, it just mention it?”

评论 #38725199 未加载

评论 #38725326 未加载

leoncaet超过 1 年前

I just hope they don't stop to offer the papers in PDF. Even when I'm on a computer, I still prefer to read PDFs.

评论 #38731769 未加载

评论 #38732094 未加载

ansk超过 1 年前

When I open a large pdf on arxiv (100+ MB, not uncommon for ML papers focused on hi-res image generation), there is a significant load time (10+ seconds) before anything is rendered at all other than a loading bar. Does anyone know what the source of this delay is? Is it network-bound or is Chrome just really slow to render large PDFs? Do PDFs have to be fully downloaded to begin rendering? In any case, this delay is my only gripe with arxiv and a progressively rendered HTML doc that instantly loads the document text would be a huge improvement.

评论 #38729591 未加载

评论 #38726902 未加载

评论 #38726779 未加载

wolverine876超过 1 年前

Many here say they prefer html documents. How do you annotate them? How do you make local copies? Also, how will you read them in the decades to come?I love PDF.

aragonite超过 1 年前

A lot of academic journals (say from Springer) also offer HTML formats for papers published in the past decade or so, which I personally often find more convenient for reading purposes than PDFs. For example, I parse text a lot faster if I use a regex to split each paragraph into sentences and place a linebreak after each sentence, or if I do natural language "syntax highlighting" by assigning a distinctive color to functional words indicating logical structure like 'if/then', 'and', 'or', 'not', 'because', and 'is'. And sometimes it really improves readability to be able to do "semantic highlighting", in the sense of say assigning a different hashed color to each proper name (or each labeled thesis, etc) that occurs in the paper. Such manipulations are basically impossible with PDFs. It makes me wish sci-hub would start archiving HTML versions in addition to PDFs!

golol超过 1 年前

IMO pdf and HTML optimize for different things. pdf is easy and pretty. HTML is easy and responsive. But making pdf responsive is impossible and making HTML pretty is not easy. I think having arxiv for well-polished pretty documents, not responsive ugly documents. Most researchers don't have time to make an HTML responsive and pretty.

评论 #38727383 未加载

FredPret超过 1 年前

This is brilliant. I don't share academia's love of LateX multi-column PDFs.

评论 #38726368 未加载

delhanty超过 1 年前

> If you are familiar with ar5iv, an arXivLabs collaboration, our HTML offering is essentially bringing this impactful project fully “in-house”. Our ultimate goal is to backfill arXiv’s entire corpus so that every paper will have an HTML version, but for now this feature is reserved for new papers.IIRC, ar5iv was created on his own initiative by Deynan Ginev<a href="https://twitter.com/dginev/status/1736792316675825981" rel="nofollow noreferrer">https://twitter.com/dginev/status/1736792316675825981</a>and it seems that he has worked tirelessly to fix nearly all of the edge cases during the collaboration.This project creates huge value to humanity so Deynan is to be heartily thanked.

评论 #38730839 未加载

pushfoo超过 1 年前

Previously discussed: <a href="https://news.ycombinator.com/item?id=38713215">https://news.ycombinator.com/item?id=38713215</a>

WendyTheWillow超过 1 年前

I’m so far left wanting for an app that gives me a way to easily track and consume newly published work of a given topic. The existing apps are not great, and maybe this change will make it easier to provide better “reader” views, and possibly even tts (I like to listen+read).

binarymax超过 1 年前

Nice! Now I don’t need to manually replace arxiv with ar5iv. Congrats to the team.

评论 #38725338 未加载

therealmarv超过 1 年前

This is the reason I've never liked LaTeX from a data point view. It's made to be printed out or get to look beautiful on a PDF but was never designed to get you to a HTML file or a Word file.I've written my thesis in Markdown in the past because of this (best for humans) which can be easily transformed to HTML, Word, PDF and even LaTeX <a href="https://github.com/tompollard/phd_thesis_markdown">https://github.com/tompollard/phd_thesis_markdown</a>And I think that XML is the best format for machines.

codethief超过 1 年前

Ugh. I don't belong to the target audience (people with disabilities) but the typesetting doesn't exactly look pleasant on my machine (Chrome on Linux).

Al-Khwarizmi超过 1 年前

Nice! It would be even better if they offered authors of previous papers the option of converting to HTML, as the latex sources are already in the system.

评论 #38725421 未加载

odyssey7超过 1 年前

<pre><code> article { text-justify: Knuth-Plass; }</code></pre>

评论 #38728791 未加载

评论 #38731258 未加载

cozzyd超过 1 年前

doesn't work great with long author lists...<a href="https://browse.arxiv.org/html/2312.12907v1" rel="nofollow noreferrer">https://browse.arxiv.org/html/2312.12907v1</a>

评论 #38726399 未加载

philipashlock超过 1 年前

30 years after HTML was invented to support accessibility and collaboration for research and academia and the same day the White House released their new accessibility guidance which happens to be the first time they've published formal new policy natively has HTML rather than PDF - <a href="https://www.whitehouse.gov/omb/management/ofcio/m-24-08-strengthening-digital-accessibility-and-the-management-of-section-508-of-the-rehabilitation-act/" rel="nofollow noreferrer">https://www.whitehouse.gov/omb/management/ofcio/m-24-08-stre...</a>

评论 #38728265 未加载

happyyalda超过 1 年前

Unfortunately, I am from Iran so I can't use this new feature. I got '403 Forbidden' message from the arXiv server. Worse than that, I totally lost my access to arXiv since they changed their CDN to fastly, because fucking mullahs don't like fastly!

hk-senokr超过 1 年前

Give it to the United States 2 minutes you're open and your hack smoker Hancock minutes and even this your combustion area of monument time cuz you said looking on the baseball miserable I didn't want to buy it I desktop and your current events are my not me his not he I took it in the garage your prime minister 70 or my event your lucky alone haircut at Josephine alone hacker smoker king Kong young under hackers no car orange county Joseph Adidas adorius avenue I got a new Nissan I thought you need something f*** at Robert Omaha Fernandez Serbia Yunnan i England England Britannia English

trostaft超过 1 年前

Taking a look at a paper I have that went up this month and another that went up before the dec cutoff on ar5iv, they look 90% OK! Figures with side-by-side plots and algorithm environments are the common culprit for being broken though. Particularly in figures, it seems like the width argument isn't being interpreted correctly.Interestingly this review paper seems to have their side by side figures intact (e.g. fig 2 fig 4). Maybe it's because he used a subfigure like environment (judging by the subcaptions)?<a href="https://ar5iv.labs.arxiv.org/html/1609.04747" rel="nofollow noreferrer">https://ar5iv.labs.arxiv.org/html/1609.04747</a>

评论 #38734011 未加载

krick超过 1 年前

Curious to see how well it will work. Does anybody here know a robust and not crazy computationally expensive solution to extract tables from fairly clean PDF files (especially non-english)?

killjoywashere超过 1 年前

So, I'm seeing a lot of chatter in the thread about LaTeX and converting that to HTML and PDF, so LaTeX should be the superior single source of truth. Please keep in mind that many areas of science think of latex as an allergy. I even have a colleague, a plasma physicist, who strongly encourages his team to not use LaTeX because a) collaborators get confused and b) it can be a massive time suck.

评论 #38736657 未加载

ChrisArchitect超过 1 年前

[dupe] from yesterdayMore here: <a href="https://news.ycombinator.com/item?id=38713215">https://news.ycombinator.com/item?id=38713215</a>

blackoil超过 1 年前

> Didn't see a toggleyou can run toggleColorScheme() twice in console to switch to light theme or dark theme.

amai超过 1 年前

This will be on of the most popular applications written in Perl, because this is based on 20 year old <a href="https://en.wikipedia.org/wiki/LaTeXML" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/LaTeXML</a>.

IHLayman超过 1 年前

Fun fact: if seems that if you use Lockdown mode on Apple devices you can't open PDFs from a browser (no official documentation says it but there is anecdotal evidence). This would allow people with Lockdown mode to open Arxiv papers more easily.

sylware超过 1 年前

Like the maths noscript/basic (x)html wikipedia generator:The magic of inline images at a known DPI, of course you can provide images for different DPIs.Reading maths/science noscript/basic (x)html documents on my 100 DPI monitor, on wikipedia. Not yet fully ready on arxiv.

forgingahead超过 1 年前

What I would like is for ArXiv to have an LLM to rewrite all papers away from the stodgy, stilted language prevalent in every paper. Just write clearly gang, use proper paragraph breaks and stop with the run-on sentences.

sicariusnoctis超过 1 年前

Personally, I would prefer the conventional Latin Modern math font instead of Palatino math.Latin Modern is used by:- Wikipedia. - Math.StackExchange. - Nearly all papers, including the ones hosted on arxiv in PDF format. - Nearly any math videos, slides/presentations, notes. - Almost everything, really.Palatino just looks weird.Also, I imagine that authors might do math formatting hacks that were only tested on Latin Modern, and might end up breaking on Palatino.TL;DR:Palatino :(Latin Modern :)

choppaface超过 1 年前

Hope they benefit from CDN caching now too.Edit: aaaand they got Fastly <a href="https://news.ycombinator.com/item?id=38723373">https://news.ycombinator.com/item?id=38723373</a>

topicseed超过 1 年前

What do they use to convert a PDF document to a clean, correct HTML document? It's a difficult space, especially with the variety of layouts you may find in PDF documents...

评论 #38728740 未加载

评论 #38728152 未加载

jcq3超过 1 年前

It will ease data scraping, automated meta analysis...

zerop超过 1 年前

They should also add commenting capabilities under the paper.. a good discussion will lead to more research and information discovery

charleshan超过 1 年前

This is awesome! Push to Kindle (HTML to EPUB) isn't converting the page properly but I'm sure it's coming soon

johnsillings超过 1 年前

<a href="https://www.arxiv-vanity.com/" rel="nofollow noreferrer">https://www.arxiv-vanity.com/</a>

评论 #38727625 未加载

endergen超过 1 年前

I was hoping this meant that html native submissions would be possible, so that people made interactive explanations.

carlosjobim超过 1 年前

With the 2024 browser update, this means I can read these articles on my ancient Kindle perfectly fine.

SallyThinks超过 1 年前

Saw it last night ! I was sooo happy ! Reading papers on phone is a nightmare. Well done guys !

alexmolas超过 1 年前

This makes downloading and parsing paper data easily, which is pretty handy in the LLM era.

gms7777超过 1 年前

About time. Biorxiv and medrxiv have been doing this for probably half a decade at this point?

评论 #38728418 未加载

评论 #38735750 未加载

apstats超过 1 年前

I wonder if this could be used to train an LLM to convert PDFs with rich charts into HTML?

alecsm超过 1 年前

I don't read many papers but this makes it easier for me to save them in Joplin.

ZeroCool2u超过 1 年前

Wow, this is _so_ much better!

101008超过 1 年前

Is there an open source tool to convert any PDF to something like this?

评论 #38738985 未加载

acjohnson55超过 1 年前

This is great! I browse papers on mobile, and PDF is so bad for that use case.

nojvek超过 1 年前

OMG. This is amazing. I legit hated reading two column pdfs on a smartphone.

hollerith超过 1 年前

I'm sad that the best they can do is HTML format. HTML is a mess.

lucidrains超过 1 年前

nice! will make reading papers on the phone so much more pleasant!

ww520超过 1 年前

That's great. Now I can read the papers on my phone.

alephnerd超过 1 年前

This is a great UX addition. Why did it take them so long?

评论 #38725192 未加载

评论 #38725343 未加载

评论 #38725038 未加载

评论 #38725251 未加载

评论 #38725157 未加载

HeavyStorm超过 1 年前

Thank God. Maybe we can now adapt those for mobile?

llamaInSouth超过 1 年前

Nice.... a website that offers even more web pages.

matrix2596超过 1 年前

thats great news. I was using arxiv vanity to read on mobile phones. I am not seeing it on all articles, is it only for new papers?

quickthrower2超过 1 年前

Reading papers on mobile now considered sane!

wildpeaks超过 1 年前

Very good decision, always bet on the web.

radicalriddler超过 1 年前

FUCK YES (excuse my profanity). I have a tool that converts HTML to Neural Speech and I always wanted to push arXiv papers through it, but couldn't be bothered with a PDF implementation.

eviks超过 1 年前

Finally a modern format you can copy&paste from and read on one of the most popular computing platforms!!!

imranq超过 1 年前

At this point are academic papers simply peer-reviewed blog posts?

matt1超过 1 年前

For anyone interested in staying informed about important new AI/ML papers on arXiv, check out <a href="https://www.emergentmind.com" rel="nofollow noreferrer">https://www.emergentmind.com</a>, a site I'm building that should help.Emergent Mind works by checking social media for arXiv paper mentions (HackerNews, Reddit, X, YouTube, and GitHub), then ranks the papers based on how much social media activity there has been and how long since the paper was published (similar to how HN and Reddit work, except using social media activity, not upvotes, for the ranking). Then, for each paper, it summarizes it using GPT-4, links to the social media discussions, paper references, and related papers.It's a fairly new site and I haven't shared it much yet. Would love any feedback or requests you all have for improving it.

评论 #38727525 未加载

评论 #38735895 未加载

评论 #38729446 未加载

评论 #38726084 未加载

评论 #38727375 未加载

评论 #38729001 未加载

评论 #38728886 未加载

评论 #38729005 未加载

评论 #38727041 未加载

winwang超过 1 年前

Probably more accessible in general. (PDF) Papers are psychologically scary.

评论 #38725864 未加载

评论 #38731746 未加载

creatonez超过 1 年前

I am glad to see a sans font being used, rather than trying to replicate the serif font from the original papers. It's a bit narrow and fuzzy on low resolutions, but a massive improvement just by switching to sans.

评论 #38731748 未加载

vegabook超过 1 年前

PDF is objectively much better than HTML at rendering text documents. And it's not even close. This could easily have been done 10, even 15-20 years ago. That it didn't is not just inertia. Latex and PDF have enormously better text rendering, and the static format locks a state-commit in time that is much easier to go back to and reference/critique. Unlike the intrinsically fluid nature of HTML. For academic work, milestone-like formats, that lock state in time, are useful for those who later build on them. And again, the rendering just doesn't compare and that imparts [sub]conscious quality signals.