HTML whitespace is broken (2024)

131 点作者 thm3 个月前

23 条评论

* All white space in HTML and XML is preserved verbatim.* HTML has a default presentation scheme that varies by interpreter. For everything else use CSS.* The default presentation of white space in HTML and XML is what is called tokenized space, which is that all consecutive white space characters are displayed as a single space character. Again, you can control this with CSS.* White space does not determine the behavior or display of other HTML tags.* White space is a text node in the DOM. If it is adjacent to other text that text and white space are one text node, otherwise the white space is its own text node.That should be all there is to it. JavaScript has absolutely no bearing on this subject.

评论 #42971950 未加载

评论 #42970833 未加载

ximm3 个月前

I expected an uninformed rant and actually got a really well informed, balanced rant. I don't agree that this is a major issue, but every time I thought of a counter argument it was immediately addressed in the article. Well done!

评论 #42970420 未加载

评论 #42972105 未加载

calibas3 个月前

The part about <aside> is worded in a way that can be confusing. It implies that a <div> is always going to be displayed as a block, but an <aside> is some kind of special element that could be either a block or inline.Whether it's "inline", "block", or "inline-block" is determined by the CSS for every visual HTML element. A <div> is not a block, nor an <a> an inline, those are just the defaults and can be easily overridden.

vintagedave3 个月前

On whitespace: what about sentences? HTML has paragraph-level, and word-level, spacing. (This article is about word-level.)There's no concept of sentences as a semantic element. I need to write a thoughtful rant on this sometime. You can read a mini-thesis here: <a href="https://daveon.design/about-dave-on-design.html#typography-&-layout:-spacing" rel="nofollow">https://daveon.design/about-dave-on-design.html#typography-&...</a>

评论 #42979820 未加载

eqvinox3 个月前

The author seems entirely unaware that HTML (SGML/XML) entities are essentially text replacement. &#32; is the same as a literal space, and breaking this (or adding a new &lf; that isn't a literal line break) would create an even worse mess than these purported whitespace problems.(Also &ZeroWidthSpace; should probably have been discussed in this article.)

评论 #42974244 未加载

radium3d3 个月前

I do the “you’re probably not going to do” mentioned all the time. Maybe that’s why I haven’t had many instances of needing to adjust for white space in html over the last almost 30 years of web development haha

peter-m803 个月前

Broken = works as expected

评论 #42971330 未加载

jfk133 个月前

Does the article's "Example 31" look correct to anyone? I see no "small space between the two boxes" (as claimed) in any of the browsers I tried.

评论 #42972006 未加载

653 个月前

One thing I've always found incredibly annoying is the &nbsp; character is not the same width as a space. It's slightly thinner.

wesammikhail3 个月前

The number of hours I've wasted over the years on trying to resolve whitespace problems over the years is... not okay to say the least.Great writeup though!

malaise3 个月前

White space being broken was one of my first gripes with HTML, and that was about 25 years ago. It’s amazing to see a whole post about it finally, but also so obvious that I’m surprised to see one.

worksonmine3 个月前

This is one of the reasons I need at least a minify step on the HTML of any site I build. Sometimes I'd prefer not to but it's the easiest solution to have both indented HTML in the source and consistent spacing in the result.I don't think it's broken though, imagine if every whitespace was rendered. And how do we know what should be collapsed? I don't see a better solution that satisfies every situation.Use <pre> to preserve the whitespace in code, for everything else use CSS.

ximm3 个月前

Concerning text-to-speech and the missing separation of HTML and CSS: There are several open issues about this in the spec that defines how accessible names and descriptions are computed from HTML elements: <a href="https://github.com/w3c/accname/issues?q=state%3Aopen%20label%3A%22topic-whitespace%22">https://github.com/w3c/accname/issues?q=state%3Aopen%20label...</a>

vintagedave3 个月前

This bothers me too:> Newlines and tabs are also treated identically and collapsed into spaces.This means that simply formatting your HTML -- as you might when hand-editing -- can add spaces. Possibly, something else is going on too: I'm not an expert here, I've just observed rendering differences when you have a newline between tags. Maybe it's inline vs block tags as the author also discusses?This has bitten me a few times and my hand-written sites or SSG-generated sites that I want to look nice have several ugly, long lines that should be broken up to be readable, but are not.

ximm3 个月前

The one thing I thought was IMHO missing from this article was JavaScript.In HTML, it is pretty natural to add white space (i.e. text nodes containing white space) between all elements. You basically only have to worry if you want to avoid that.In JavaScript, the opposite is true. If you want to create a text node, you have to do so explicitly. If you just create elements and append them to the same parent, they will be added without whitespace.I am not sure how JSX behaves in this regard. Last time I checked it was more like JavaScript than HTML, which was of curse very confusing for people.

评论 #42970785 未加载

P-Nuts3 个月前

It has always slightly nagged me that web browsers don’t do as good a job as TeX at line breaking and sentence spacing.

nilslindemann3 个月前

It should remove all whitespace before and after line breaks ("\n"), including the line break itself. And otherwise replace inline whitespace with one space. In pre's keep the whitespace. Well, at least we have display:flex.

thro13 个月前

Working XMP and LISTING Tag Element Examples: <a href="http://www.the-pope.com/listin.html" rel="nofollow">http://www.the-pope.com/listin.html</a>

LLcolD3 个月前

The issue probably comes from the fact that web browsers try to render HTML even if it’s not perfect. HTML isn’t super strict, so browsers will still display pages with small mistakes. There was a push to make HTML stricter with XHTML, which enforced rules like case-sensitive elements and closing tags, kind of like XML. But it didn’t really stick. Browsers had a hard time with those stricter rules, so HTML’s more relaxed approach stuck around. For some time I really tried to use XHTML when createing weh pages, but then I asked my self why all of the trouble when browsers don't follow the standards.

评论 #42974286 未加载

TheChaplain3 个月前

It's hard to take any article seriously with a header like "X is broken" when it obviously isn't for a ridiculously large majority.

评论 #42970290 未加载

chrismorgan3 个月前

The bit about <pre> ignoring leading newlines is essentially a completely different thing from the rest of the article. It’s specifically HTML syntax parser behaviour <<a href="https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody" rel="nofollow">https://html.spec.whatwg.org/multipage/parsing.html#parsing-...</a>>:> If the next token is a U+000A LINE FEED (LF) character token, then ignore that token and move on to the next one. (Newlines at the start of pre blocks are ignored as an authoring convenience.)This is also applied to <textarea>.Personally, I think it was a mistake, because it complicates things and doesn’t do enough to justify itself. If it also did leading whitespace trimming across all lines, it’d be interesting enough to maybe justify itself as an authoring convenience (… though honestly I suspect that’d end up worse), but as it is it’s just an extra complication. I’ve needed to deal with the nuance of its special behaviour more than once or twice, and I’ve seen others stumble over it too. It’s also part of the fairly small pile of HTML features that make it not-round-trippable: it’s only done in parsing; the serialiser doesn’t insert an extra ␊ if it would emit `<pre>␊`.This is one of the many cases that tempts me in the direction of the XML syntax (which, to many people’s surprise, is absolutely still a thing—save a local file with extension .xhtml, or serve over HTTP with MIME type application/xhtml+xml). The fact that XML doesn’t have a parser that guesses what you meant is generally a nice feature.(XML also has whitespace collapsing, xml:space. Honestly it’s interesting in this context, conveying whitespace-handling intent, but I’ll ignore it. Because it’s never coming to HTML.)But we’re stuck with this behaviour, because it would break compatibility.And that’s where a lot of the rest of the article baffles me, because I get the general sense, from the way he presents information, that this guy doesn’t understand a lot of HTML’s history and philosophy, things I’d expect to be understood by a memory of the Angular team. The suggestions made are generally just obviously not suitable for HTML, not just because of compatibility, but also because of philosophy.You think &#32; should be different from SPACE? Sorry, I think we’re up to about forty years since that ship sailed; entities/character references are strictly shorthand, and numbered entities are strictly code points. And do you know how confusing it would be if it worked differently? It would be a one-off special case.You think you can add a new entity to handle this? In XML I think you might be able to do that (way too long since I’ve written a DTD to remember clearly), but in HTML they’re called character references, because that’s all they can be, and your non-collapsing space would need to be either something entirely new in the document model, or shorthand for something like .> You'd think the CMS should be able to solve this problem, but it really can't.Uh, yes it can, and they all do, where they accept plain text, by either chunking the text into HTML paragraphs (e.g. "" + s/\n\n/<\/p>/ + ""), or by turning your text line breaks into HTML line breaks (e.g. s/\n/ /). CMSes do a lot of dodgy stuff like this. If you want to have nightmares, look at WordPress’s wpautop function, and think through the implications of it all. It’s a radioactive wasteland of bad ideas.It’s also rather important to remember that two line breaks in HTML (e.g. A B) is not the same as a paragraph break (e.g. AB). Consider margins and text-indent, for a start.> How Could we Fix This?The offered solution, “quote your strings”, is what almost all programming languages tend to do. Document languages practically never quote their strings (I can’t immediately think of any even vaguely popular ones that do). Document languages consistently default to text mode, with only markup elements requiring special syntax.As is later noted, there is, of course, absolutely no chance of HTML ever doing anything even vaguely like this. And honestly, if such a breaking change were on the cards, you’d be making far more invasive changes to HTML’s syntax.> 3. HTML already breaks the rules of common text formatting.> • The idea that you can write HTML today by just typing the text you want is a lie.No it isn’t: no one ever suggested that was a feature; there was no dishonesty. HTML is a markup language.—⁂—The remark on template language whitespace control is incorrect:Say hello to {%- username -%} and welcome them to the team!You’ll actually get “Say hello toDeveland welcome them to the team!” which is clearly not what’s wanted.—⁂—For my own part, I have at times seriously considered producing HTML with only the whitespace I mean, and applying something along the lines of `:root { white-space: pre-wrap }`.But then I remember that there’s a lot more that’s dodgy around segmentation, both in the directions of extraneous and missing breaks. For example, this URL and its rendering:<pre><code> data:text/html,<body style=font-family:monospace;width:5ch>Look at C++! X &lt;/a> Look at C+ +! X </ a> </code></pre> Viewing on my phone (which, due to narrower column width, is more likely to demonstrate such problems), I think I’ve come across three articles on HN in the last week or so exhibiting this sort of problem. If I were writing much that referred to C++, I would genuinely make something to change it to C++, and I do sometimes tweak breaking behaviour inside <code> elements to control where breaks can occur. (I’m also the kind of guy who types actual no-break spaces in Bible references where the book has an ordinal, e.g. “1 John 2:3” will have one NBSP and one SPACE.)And in the end… HTML collapsing whitespace has done a lot to quell the two-spaces-between-sentences convention some hold, so it’s not all bad. ;-)

评论 #42974405 未加载

评论 #42976588 未加载

BoujidStack3 个月前

Whitespace in HTML always seems to do its own thing, never quite what you expect!

bloak3 个月前

U+0020 is perhaps the weirdest character in Unicode. Most of the time it's used to separate words, which is arguably a kind of mark-up. But sometimes it's used for other kinds of formatting. Also, if you were to use explicit mark-up for words I have no idea how you'd handle punctuation. Perhaps writing should be redesigned from the ground up?But meanwhile, though we'd all love to see the plan, let's stick with the mess we're used to.

23 条评论

austin-cheney3 个月前

评论 #42971950 未加载

评论 #42970833 未加载

ximm3 个月前

评论 #42970420 未加载

评论 #42972105 未加载

calibas3 个月前

vintagedave3 个月前

评论 #42979820 未加载

eqvinox3 个月前

评论 #42974244 未加载

radium3d3 个月前

peter-m803 个月前

Broken = works as expected

评论 #42971330 未加载

jfk133 个月前

Does the article's "Example 31" look correct to anyone? I see no "small space between the two boxes" (as claimed) in any of the browsers I tried.

评论 #42972006 未加载

653 个月前

One thing I've always found incredibly annoying is the &nbsp; character is not the same width as a space. It's slightly thinner.

wesammikhail3 个月前

The number of hours I've wasted over the years on trying to resolve whitespace problems over the years is... not okay to say the least.Great writeup though!

malaise3 个月前

worksonmine3 个月前

ximm3 个月前

vintagedave3 个月前

ximm3 个月前

评论 #42970785 未加载

P-Nuts3 个月前

It has always slightly nagged me that web browsers don’t do as good a job as TeX at line breaking and sentence spacing.

nilslindemann3 个月前

thro13 个月前

Working XMP and LISTING Tag Element Examples: <a href="http://www.the-pope.com/listin.html" rel="nofollow">http://www.the-pope.com/listin.html</a>

LLcolD3 个月前

评论 #42974286 未加载

TheChaplain3 个月前

It's hard to take any article seriously with a header like "X is broken" when it obviously isn't for a ridiculously large majority.

评论 #42970290 未加载

chrismorgan3 个月前

评论 #42974405 未加载

评论 #42976588 未加载

BoujidStack3 个月前

Whitespace in HTML always seems to do its own thing, never quite what you expect!

bloak3 个月前