TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

You can't just assume UTF-8

195 pointsby calpatersonabout 1 year ago

62 comments

JonChesterfieldabout 1 year ago
How about assume utf-8, and if someone has some binary file they&#x27;d rather a program interpret as some other format, they turn it into utf-8 using a standalone program first. Instead of burning this guess-what-bytes-they-might-like nonsense into all the software.<p>We don&#x27;t go &quot;oh that input that&#x27;s supposed to be json? It looks like a malformed csv file, let&#x27;s silently have a go at fixing that up for you&quot;. Or at least we shouldn&#x27;t, some software probably does.
评论 #40205376 未加载
评论 #40206650 未加载
评论 #40204873 未加载
评论 #40205194 未加载
评论 #40205068 未加载
评论 #40205040 未加载
评论 #40205931 未加载
评论 #40207470 未加载
评论 #40207408 未加载
kstrauserabout 1 year ago
If you give me a computer timestamp without a timezone, I can and will assume it&#x27;s in UTC. It might not be, but if it&#x27;s not and I process it as though it is, and the sender doesn&#x27;t like the results, that&#x27;s on them. I&#x27;m willing to spend approximately zero effort trying to guess what nonstandard thing they&#x27;re trying to send me unless they&#x27;re paying me or my company a whole lot of money, in which case I&#x27;ll convert it to UTC upon import and continue on from there.<p>Same with UTF-8. Life&#x27;s too short for bothering with anything else today. I&#x27;ll deal with some weird janky encoding for the right price, but the first thing I&#x27;d do is convert it to UTF-8. Damned if I&#x27;m going to complicate the innards of my code with special case code paths for non-UTF-8.<p>If there were some inherent issue with UTF-8 that made it significantly worse than some other encoding for a given task, I&#x27;d be sympathetic to that explanation and wouldn&#x27;t be such a pain in the neck about this. For instance, if it were the case that it did a bad job of encoding Mandarin or Urdu or Xhosa or Persian, and the people who use those languages strongly preferred to use something else, I&#x27;d understand. However, I&#x27;ve never heard a viable explanation for <i>not</i> using UTF-8 other than legacy software support, and if you want to continue to use something ancient and weird, it&#x27;s on you to adapt it to the rest of the world because they&#x27;re definitely not going to adapt the world to you.
评论 #40205852 未加载
评论 #40205048 未加载
评论 #40209925 未加载
评论 #40206888 未加载
评论 #40208585 未加载
mikhailfrancoabout 1 year ago
Developers <i>should</i> assume UTF-8 for text files going forward.<p>UTF-8 should have no BOM. It is the default. And there are no undefined Byte sequences that need an Order. Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?<p>Others variants of Unicode have BOMs, e.g. UTF-16BE. We know CJK languages need UTF-16 for compression. The BOM is only a couple more bytes. No problem, so far so good.<p>But there are old files, that are in &#x27;platform encoding&#x27;. Fine, let there be an OS &#x27;locale&#x27;, that has a default encoding. That default can be overridden with another OS &#x27;encoding&#x27; variable. And that can be overridden by an application arg. And there may be a final default for a specific application, that is only ever used with one language encoding. Then individual files can override all of the above ...<p>Text file (serialization) formats should define handling of optional BOM followed by an ASCII header that defines the encoding of the body that follows. One can also imagine .3 filetypes that have a unique or default encoding, with values maintained by, say, IANA (like MIME types). XML got this right. JSON and CSV are woefully neglectful, almost to the point of <i>criminal liability.</i><p>But in the absence of all of the above, the default-default-default-default-default is UTF-8.<p>We are talking about the future, not the past. Design for UTF-8 default, and BOMs for other Unicodes. Microsoft should have defined BOMs for Windows-Blah encodings, not for UTF-8!<p>When the whole world looks to the future, Microsoft will follow. Eventually. Reluctantly.
评论 #40213283 未加载
评论 #40212338 未加载
评论 #40213246 未加载
评论 #40210516 未加载
评论 #40212866 未加载
评论 #40212433 未加载
评论 #40212436 未加载
kazinatorabout 1 year ago
Indeed, you can&#x27;t assume UTF-8.<p>What you do, rather, is drop support for non-UTF-8.<p>Work with tech-stacks whose text handling is based strictly around Unicode and UTF-8, and find enough opportunities that way that you don&#x27;t have to care about anything else.<p>Let the customers who cling to data in weird encodings go to someone who makes it a nice to support that.
djha-skinabout 1 year ago
Joel spolsky spoke against this exact statistics-based approach when he wrote about unicode[1]:<p>&gt; What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, <i>based on the frequency in which various bytes appear in typical text in typical encodings of various languages</i>, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, <i>until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language</i>, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.<p>1: <a href="https:&#x2F;&#x2F;www.joelonsoftware.com&#x2F;2003&#x2F;10&#x2F;08&#x2F;the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.joelonsoftware.com&#x2F;2003&#x2F;10&#x2F;08&#x2F;the-absolute-minim...</a>
评论 #40207599 未加载
评论 #40208381 未加载
评论 #40206548 未加载
评论 #40206824 未加载
bhaneyabout 1 year ago
I&#x27;m just gonna assume UTF-8
评论 #40204642 未加载
评论 #40195154 未加载
评论 #40206079 未加载
hnickabout 1 year ago
Based on my past role, you can&#x27;t even assume UTF-8 when the file says it&#x27;s UTF-8.<p>Clients would constantly send CSV or other files with an explicit BOM or other marking indicating UTF-8 but the parser would choke since they just output native Windows-1252 or similar into it. I think some programs just spit it out since it&#x27;s standard.
groestlabout 1 year ago
I will assume it, I will enforce it where I can, and I will fight tooth and nail should push come to shove.<p>I got 99 problems, but charsets aint one of them.
zadokshiabout 1 year ago
Better to assume UTF8 and fail with a clear message&#x2F;warning. Sure you can offer to guess to help the end user if it fails, but as other people have pointed out, it’s been standard for a long time now. Even python caved and accepted it as the default: <a href="https:&#x2F;&#x2F;peps.python.org&#x2F;pep-0686&#x2F;" rel="nofollow">https:&#x2F;&#x2F;peps.python.org&#x2F;pep-0686&#x2F;</a>
Veservabout 1 year ago
Off-topic, but the bit numbering convention is deliciously confusing.<p>Little-endian bytes (lowest byte is leftmost) and big-endian bits (bits contributing less numerical value are rightmost) are normal, but the bits are referenced&#x2F;numbered little-endian (first bit is leftmost even though it contributes the most numerical value). When I first read the numbering convention I thought it was going to be a breath of fresh air of someone using the much more sane, but non-standard, little-endian bits with little-endian bytes, but it was actually another layered twist. Hopefully someday English can write numbers little-endian, which is objectively superior, and do away with this whole mess.
评论 #40204800 未加载
o11cabout 1 year ago
Default UTF-8 is better than the linked suggestion of using a heuristic, but failing catastrophic when old data is encountered is unacceptable. There <i>must</i> be a fallback.<p>(Note that the heuristic for &quot;is this intended to be UTF-8&quot; is pretty reliable, but most other encoding-detection heuristics are very bad quality)
lifthrasiirabout 1 year ago
You can&#x27;t just assume UTF-8, but you can <i>verify</i> that it is almost surely encoded in UTF-8 unlike other legacy encodings. Which makes UTF-8 the first and foremost consideration.
norirabout 1 year ago
If it&#x27;s turtles all the way down and at every level you use utf-8, it&#x27;s hard to see how any input with a different encoding (for the same underlying text) will not be detected before any unintended side effects were invoked.<p>At this point, I don&#x27;t see any sufficiently good reason to not use utf-8 exclusively in any new system. Conversions to and from other encodings would only be done at well defined boundaries when I&#x27;m calling into dependencies that require non utf-8 input for whatever reason.
bandyabootabout 1 year ago
&gt; In the most popular character encoding, UTF-8, character number 65 (&quot;A&quot;) is written:<p>&gt; 01000001<p>&gt; Only the second and final bits are 1, or &quot;on&quot;.<p>Isn’t it more accurate to say that the first and penultimate bits are 1, or “on”?
评论 #40204698 未加载
评论 #40208537 未加载
vitautabout 1 year ago
This is so spectacularly outdated. KOI-8 has been dead for ages.
vkakuabout 1 year ago
The probability of web content not in UTF-8 is increasingly getting lower and lower.<p>Last I tracked, as of this month, 0.3% of surveyed web pages used Shift JIS. It has been declining steadily. I really hope people move to UTF-8. While it is important to understand how the code pages and encodings helped, I think it&#x27;s a good time to actually start moving a lot of applications to use UTF-8. I am perfectly okay if people want to use UTF-16 (the OG Unicode) and it&#x27;s extensions alternatively, especially for Asian applications.<p>Yes, historic data preservation requires a different strategy than designing stuff for the future. It is okay to however migrate to these encodings and keep giving old data and software new life.
评论 #40210068 未加载
mihaalyabout 1 year ago
Excellent article, good content, good length, enlightened subtexts and references, joy to read.
lolcabout 1 year ago
Just the most recent episode: A statistician is using PHP, on Windows, to analyze text for character frequency. He&#x27;s rather confused by the UTF-16LE encoding and thinks the character &quot;A&quot; is numbered 4100 because that&#x27;s what is shown in a hex-editor. I tried explaining about the little-endian part, and mb-string functions in PHP. And that PHP is not a good fit for his projects.<p>Then I realized that this is hilarious and I won&#x27;t be able to kick him from his local minimum there. Everything he could learn about encodings would first complicate his work.
flohofwoeabout 1 year ago
The post seems to assume that only UTF-16 has Byte Order Marks, but as pointless as it sounds, UTF-8 has a BOM too (EF BB BF). It seems to be a Windows thing though, haven&#x27;t seen it in the wild anywhere else (and also rarely on Windows - since text editors typically allow to save UTF-8 files with or without BOM. I guess it depends on the text editor which of those is the default).
评论 #40210953 未加载
评论 #40207938 未加载
rob74about 1 year ago
30 years ago: &quot;you can&#x27;t just assume ASCII&quot;<p>Today: &quot;you can&#x27;t just assume UTF-8&quot;<p>The more things change, the more they stay the same...
pronoiacabout 1 year ago
Archive copy: <a href="https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20240429061925&#x2F;https:&#x2F;&#x2F;csvbase.com&#x2F;blog&#x2F;9" rel="nofollow">https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20240429061925&#x2F;https:&#x2F;&#x2F;csvbase.c...</a>
drdaemanabout 1 year ago
Anyone got EBCDIC on their bingo cards? Because if the argument is &quot;legacy encodings are still relevant in 2024&quot; then we also need to bring EBCDIC (and EBCDIK and UTF-EBCDIC for more perverted fun) into the picture. Makes heuristics extra fun.<p>Or, you know, just say &quot;nah, I can, those ancient stuff don&#x27;t matter (outside of obligatory exceptions, like software archeology) anymore.&quot; If someone wants to feed me a KOI8-R or JIS X 0201 CSV heirloom, they should convert it into something modern first.
评论 #40205197 未加载
评论 #40205133 未加载
iamcreasyabout 1 year ago
By heuristics, is the author referring to the rules and policies published by Unicode? [1]<p>Link[1] was referring as a solution to this problem in this article: Absolute Minimum Every Software Developer Must Know About Unicode in 2023 [2].<p>[1] <a href="https:&#x2F;&#x2F;www.unicode.org&#x2F;reports&#x2F;tr29&#x2F;#Grapheme_Cluster_Boundaries" rel="nofollow">https:&#x2F;&#x2F;www.unicode.org&#x2F;reports&#x2F;tr29&#x2F;#Grapheme_Cluster_Bound...</a><p>[2] <a href="https:&#x2F;&#x2F;tonsky.me&#x2F;blog&#x2F;unicode&#x2F;" rel="nofollow">https:&#x2F;&#x2F;tonsky.me&#x2F;blog&#x2F;unicode&#x2F;</a>
mgaunardabout 1 year ago
There is a pretty successful world language standard: English.
评论 #40205374 未加载
jujube3about 1 year ago
Actually, I can just assume UTF-8, since that&#x27;s what the world standardized on. Just like I can assume the length of a meter or the weight of a gram. There is no need to have dozens of incompatible systems.
评论 #40204682 未加载
评论 #40204644 未加载
评论 #40204832 未加载
评论 #40204768 未加载
评论 #40204702 未加载
评论 #40204784 未加载
mschuster91about 1 year ago
&gt; CSV files, in particular, have no way to signal, in-band, which encoding is used.<p>That&#x27;s actually wrong. Add an UTF-8 BOM, that&#x27;s enough for Excel (and some other libraries) to know what is going on [1].<p>[1] <a href="https:&#x2F;&#x2F;csv.thephpleague.com&#x2F;8.0&#x2F;bom&#x2F;" rel="nofollow">https:&#x2F;&#x2F;csv.thephpleague.com&#x2F;8.0&#x2F;bom&#x2F;</a>
planedeabout 1 year ago
A bag of bytes is a bag of bytes. Any encoding should be either assumed by the protocol being used or otherwise specified.
jcranmerabout 1 year ago
I haven&#x27;t seen discussion of this point yet, but the post completely fails to provide any data to back up its assertion that charset detection heuristics works, because the feedback I&#x27;ve seen from people who actually work with charsets is that it largely <i>doesn&#x27;t</i> (especially if you&#x27;re based on naive one-byte frequency analysis). Okay, sure, it works if you want to distinguish between KOI8-R and Windows-1252, but what about Windows-1252 and Windows-1257?<p>See for example this effort in building a universal charset detector in Gecko: <a href="https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1551276" rel="nofollow">https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1551276</a>
评论 #40205128 未加载
评论 #40205189 未加载
Karellenabout 1 year ago
Don&#x27;t worry, I never assume UTF-8.<p>I <i>require</i> UTF-8. If it isn&#x27;t currently UTF-8, it&#x27;s someone else&#x27;s problem to transform it to UTF-8 first. If they haven&#x27;t, and I get non-UTF-8 input, I&#x27;m fine bailing on that with a &quot;malformed input - please correct&quot; error.
评论 #40204987 未加载
评论 #40204889 未加载
chungyabout 1 year ago
The article&#x27;s pretty weird for presenting little-endian UTF-16 as normal and barely even mentioning that big-endian is an option (in fact, seems to refer to it as &quot;backwards&quot;), even though big-endian is a much more human readable format.
pyluaabout 1 year ago
Stupid question: how are the headers passed for http? What encoding describes the encoding ?
评论 #40206856 未加载
评论 #40206857 未加载
评论 #40206913 未加载
hot_grilabout 1 year ago
Java and Javascript both use UTF-16 for internal string representation, even though JSON specifies UTF-8. Windows APIs too. I&#x27;m still not sure why, but it means that one char uses at least 2 bytes even if it&#x27;s in the ASCII range.
评论 #40206898 未加载
dandigangiabout 1 year ago
Except I can
GnarfGnarfabout 1 year ago
Why would anyone use anything other than UTF-8 in this day and age?<p>Windows took a gamble years ago when the winner was not obvious, we’re stuck with UCS-2, but you can circumvent that with a good string library like QT’s Qstring.
otikikabout 1 year ago
However you can check for invalid utf-8 sequences, throw an error with &quot;invalid encoding on byte x, please use valid utf-8&quot; if encountered, and from that point on assume utf-8.
Dweditabout 1 year ago
But you can assume non-UTF-8 upon seeing an invalid UTF-8 byte sequence. From there, it can be application-specific depending on what encoding you expect to see.<p>(And of course UTF-16 if there&#x27;s a BOM)
camgunzabout 1 year ago
Dear lazyweb: I think I read something about Postel&#x27;s Law being essential to the internet&#x27;s success -- maybe this was also IPv6 related? Does anyone else remember this?
评论 #40241040 未加载
teliskrabout 1 year ago
I maintain a system I created in 2004 (crazy right). Not sure how we lived; but at the time, emojis were not as much of a thing. This has come to bite me several times.
ahiabout 1 year ago
Why utf-8 when we have MARC-8? <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;MARC-8" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;MARC-8</a>
评论 #40206647 未加载
taninabout 1 year ago
I&#x27;m actually having this issue where users import CSV files that don&#x27;t seem to be valid. DuckDB would throw out an error like: Invalid Closing Quote: found non trimable byte after quote at line 34, Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in value construction, Value with unterminated quote found<p>One example: pgAdmin can export a database table into a CSV... but the CSV isn&#x27;t valid for DuckDB to consume. Because, for some odd reason, pgAdmin uses a single quote to escape a double quote.<p>This blog is pretty timely. Thank you for writing it!
AzzyHNabout 1 year ago
I&#x27;m not a programmer, but how hard is it to write something that just checks what encoding the file uses. Or at least does its best to guess?
评论 #40206590 未加载
评论 #40206865 未加载
评论 #40206532 未加载
Havocabout 1 year ago
You underestimate my willingness to happy path code...
missblitabout 1 year ago
Encodings I have used or been exposed to in my career: ASCII, Latin1, Windows-1252, UTF-8, UTF-16, UTF-32, GBK, Zawgyi, Shift-JIS
kdklolabout 1 year ago
&gt;You can&#x27;t just assume UTF-8<p>But I will, because in this day and age, I should be perfectly able to do so. Non-use of UTF-8 should be simply considered a bug and not treating text as UTF-8 should frankly be a you problem. At least for anything reasonably modern, by which I mean made in the last 15 years at least.
dublinabout 1 year ago
Make your life easy. Assume 7-bit ASCII. No one needs all those other characters, anyway...
评论 #40204909 未加载
评论 #40206338 未加载
rchabout 1 year ago
Don&#x27;t assume; force UTF-8.
teknopaulabout 1 year ago
I suspect the author does not about the first bit being 1 thing.<p>utf8 is magic.<p>You can assume US ASCII for lots of very useful text protocols , like http and Stomp and not care what the variable string bytes mean.<p>Soooo many software architects don&#x27;t grok the magic of it.<p>You can define 8bit parser that check for &quot;a:&quot;(some string)\n<p>and work with a shit load of human languages.<p>The company I work for does not realise that most of f the 50 year old legacy C it has is fine with utf8 for all the arbitrary fixed length or \0 terminated strings it stores.
neonsunsetabout 1 year ago
I <i>will</i> assume UTF-8 and if it&#x27;s not, it will be your fault :)
smeagullabout 1 year ago
I absolutely can. If it&#x27;s not UTF-8, I assume it&#x27;s worthless.
hchakabout 1 year ago
Hoping that LLMs can solve our “Tower of Babel” problem… :)
stalfosknightabout 1 year ago
You can on all of Apple&#x27;s platforms.
teddyhabout 1 year ago
If you’re actually in a position where you <i>need</i> to <i>guess</i> the encoding, something like “ftfy” &lt;<a href="https:&#x2F;&#x2F;github.com&#x2F;rspeer&#x2F;python-ftfy">https:&#x2F;&#x2F;github.com&#x2F;rspeer&#x2F;python-ftfy</a>&gt; (webapp: &lt;<a href="https:&#x2F;&#x2F;ftfy.vercel.app&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ftfy.vercel.app&#x2F;</a>&gt;) is a perfectly reasonable choice.<p>But, you should always do your absolute utmost <i>not</i> to be put in a situation where guessing is your only choice.
eqvinoxabout 1 year ago
The solution, obviously, is to train an LLM to recognize the character set.
评论 #40210104 未加载
koito17about 1 year ago
The comments in this thread are a bit amusing.<p>I wish I could live in the world where I could bluntly say &quot;I will assume UTF-8 and ignore the rest of the world&quot;. Many Japanese documents and sites still use Shift JIS. Windows has this strange Windows-932 format that you will frequently encounter in CUE files outputted by some CD ripping software. ARIB STD-B24, the captioning standard used in Japanese television, has its own text encoding with characters not found in either JIS X 0201 or JIS X 0208. These special characters are mostly icons used in traffic and weather reports, but transcoding to UTF-8 still causes trouble with these icons.
评论 #40205079 未加载
评论 #40205725 未加载
评论 #40205037 未加载
jherikoabout 1 year ago
no mention of the BOM... tragic.
mseepgoodabout 1 year ago
Yes, I can.
timoteostewartabout 1 year ago
Fascinating topic. There are two ways the user&#x2F;client&#x2F;browser receives reports about the character encoding of content. And there are hefty caveats about how reliable those reports are.<p>(1) First, the Web server usually reports a character encoding, a.k.a. charset, in the HTTP headers that come with the content. Of course, the HTTP headers are not part of the HTML document but are rather part of the overhead of what the Web server sends to the user&#x2F;client&#x2F;browser. (The HTTP headers and the `head` element of an HTML document are entirely different.) One of these HTTP headers is called Content-Type, and conventionally this header often reports a character encoding, e.g., &quot;Content-Type: text&#x2F;html; charset=UTF-8&quot;. So this is one place a character encoding is reported.<p>If the actual content is <i>not</i> an (X)HTML file, the HTTP header might be the only report the user&#x2F;client&#x2F;browser receives about the character encoding. Consider accessing a plain text file via HTTP. The text file isn&#x27;t likely to itself contain information about what character encoding it uses. The HTTP header of &quot;Content-Type: text&#x2F;plain; charset=UTF-8&quot; might be the only character encoding information that is reported.<p>(2) Now, if the content is an (X)HTML page, a charset encoding is often also reported in the content itself, generally in the HTML document&#x27;s head section in a meta tag such as &#x27;&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text&#x2F;html; charset=utf-8&quot;&#x2F;&gt;&#x27; or &#x27;&lt;meta charset=&quot;utf-8&quot;&gt;&#x27;. Now just because an HTML document self-reports that it uses a UTF-8 (or whatever) character encoding, that&#x27;s hardly a guarantee that the document does in fact use said character encoding.<p>Consider the case of a program that generates web pages using a boilerplate template still using an ancient default of ISO-8859-1 in the meta charset tag of its head element, even though the body content that goes into the template is being pulled from a database that spits out a default of utf-8. Boom. Mismatch. Janky code is spitting out mismatched and inaccurate character encoding information every day.<p>Or to consider web servers. Consider a web server whose config file contains the typo &quot;uft-8&quot; because somebody fat-fingered while updating the config (I&#x27;ve seen this in random web pages.). Or consider a web server that uses a global default of &quot;utf-8&quot; in its outgoing HTTP headers even when the content being served is a hodge-podge of UTF-8, WINDOWS-1251, WINDOWS-1252, and ISO-8859-1. This too happens all the time.<p>I think the most important takeaway is that with both HTTP headers and meta tags, there&#x27;s no intrinsic link between the character encoding being reported and the <i>actual</i> character encoding of the content. What a Web server tells me and what&#x27;s in the meta tag in the markup just count as two reports. They might be accurate, they might not be. If it really matters to me what the character encoding is, there&#x27;s nothing for it but to determine the character encoding myself.<p>I have a Hacker News reader, <a href="https:&#x2F;&#x2F;www.thnr.net" rel="nofollow">https:&#x2F;&#x2F;www.thnr.net</a>, and my program downloads the URL for every HN story with an outgoing link. I have seen binary files sent with a &quot;UTF-8&quot; Content-Type header. I have seen UTF-8 files sent with a &quot;inode&#x2F;x-empty&quot; Content-Type header. My logs have literally hundreds of goofy inaccurate reports of content types and character encodings. Because I&#x27;m fastidious and I want to know what a file actually is, I have a function `get_textual_mimetype` that analyzes the content of what the URL&#x27;s web server sends me. My program downloads the content and uses tools such as `iconv` and `isutf8` to get some information about what encoding it might be. It uses `xmlwf` to check if it&#x27;s well-formed XML. It uses `jq` to check whether it&#x27;s valid JSON. It uses `libmagic`. There&#x27;s a lot of fun stuff the program does to pin down with a high degree of certainty what the content is. I want my program to know whether the content is an application&#x2F;pdf, an iamge&#x2F;webp, a text&#x2F;html, an application&#x2F;xhtml+xml, a text&#x2F;x-csrc, or whatever. Only a rigorous analysis will tell you the truth. (If anyone is curious, the source for `get_textual_mimetype` is in the repo for my HN reader project: <a href="https:&#x2F;&#x2F;github.com&#x2F;timoteostewart&#x2F;timbos-hn-reader&#x2F;blob&#x2F;main&#x2F;utils_mimetypes_magic.py">https:&#x2F;&#x2F;github.com&#x2F;timoteostewart&#x2F;timbos-hn-reader&#x2F;blob&#x2F;main...</a> )
matheusmoreiraabout 1 year ago
Yeah but I&#x27;m gonna do it anyway. If it&#x27;s not UTF-8 it&#x27;s terrible and broken and not worth supporting unless some serious cash is on the table.
klysmabout 1 year ago
Meh I&#x27;d rather start making those assumptions and blame everything that doesn&#x27;t use UTF-8 as broken.
AlienRobotabout 1 year ago
&quot;You can&#x27;t assume a 32 bit integer starts from 0&quot;
jherikoabout 1 year ago
Fuck me it&#x27;s 2024<p>2014 me shaking his head in ways 2004 me saw coming in 1994.<p>Im only 40.<p>FUCK
TypicalHogabout 1 year ago
IMO humanity should just do ASCII and call it a day. I&#x27;m not talking about viability of this, just my wishes.