TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

You can't just assume UTF-8

195 点作者 calpaterson大约 1 年前

62 条评论

JonChesterfield大约 1 年前
How about assume utf-8, and if someone has some binary file they&#x27;d rather a program interpret as some other format, they turn it into utf-8 using a standalone program first. Instead of burning this guess-what-bytes-they-might-like nonsense into all the software.<p>We don&#x27;t go &quot;oh that input that&#x27;s supposed to be json? It looks like a malformed csv file, let&#x27;s silently have a go at fixing that up for you&quot;. Or at least we shouldn&#x27;t, some software probably does.
评论 #40205376 未加载
评论 #40206650 未加载
评论 #40204873 未加载
评论 #40205194 未加载
评论 #40205068 未加载
评论 #40205040 未加载
评论 #40205931 未加载
评论 #40207470 未加载
评论 #40207408 未加载
kstrauser大约 1 年前
If you give me a computer timestamp without a timezone, I can and will assume it&#x27;s in UTC. It might not be, but if it&#x27;s not and I process it as though it is, and the sender doesn&#x27;t like the results, that&#x27;s on them. I&#x27;m willing to spend approximately zero effort trying to guess what nonstandard thing they&#x27;re trying to send me unless they&#x27;re paying me or my company a whole lot of money, in which case I&#x27;ll convert it to UTC upon import and continue on from there.<p>Same with UTF-8. Life&#x27;s too short for bothering with anything else today. I&#x27;ll deal with some weird janky encoding for the right price, but the first thing I&#x27;d do is convert it to UTF-8. Damned if I&#x27;m going to complicate the innards of my code with special case code paths for non-UTF-8.<p>If there were some inherent issue with UTF-8 that made it significantly worse than some other encoding for a given task, I&#x27;d be sympathetic to that explanation and wouldn&#x27;t be such a pain in the neck about this. For instance, if it were the case that it did a bad job of encoding Mandarin or Urdu or Xhosa or Persian, and the people who use those languages strongly preferred to use something else, I&#x27;d understand. However, I&#x27;ve never heard a viable explanation for <i>not</i> using UTF-8 other than legacy software support, and if you want to continue to use something ancient and weird, it&#x27;s on you to adapt it to the rest of the world because they&#x27;re definitely not going to adapt the world to you.
评论 #40205852 未加载
评论 #40205048 未加载
评论 #40209925 未加载
评论 #40206888 未加载
评论 #40208585 未加载
mikhailfranco大约 1 年前
Developers <i>should</i> assume UTF-8 for text files going forward.<p>UTF-8 should have no BOM. It is the default. And there are no undefined Byte sequences that need an Order. Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?<p>Others variants of Unicode have BOMs, e.g. UTF-16BE. We know CJK languages need UTF-16 for compression. The BOM is only a couple more bytes. No problem, so far so good.<p>But there are old files, that are in &#x27;platform encoding&#x27;. Fine, let there be an OS &#x27;locale&#x27;, that has a default encoding. That default can be overridden with another OS &#x27;encoding&#x27; variable. And that can be overridden by an application arg. And there may be a final default for a specific application, that is only ever used with one language encoding. Then individual files can override all of the above ...<p>Text file (serialization) formats should define handling of optional BOM followed by an ASCII header that defines the encoding of the body that follows. One can also imagine .3 filetypes that have a unique or default encoding, with values maintained by, say, IANA (like MIME types). XML got this right. JSON and CSV are woefully neglectful, almost to the point of <i>criminal liability.</i><p>But in the absence of all of the above, the default-default-default-default-default is UTF-8.<p>We are talking about the future, not the past. Design for UTF-8 default, and BOMs for other Unicodes. Microsoft should have defined BOMs for Windows-Blah encodings, not for UTF-8!<p>When the whole world looks to the future, Microsoft will follow. Eventually. Reluctantly.
评论 #40213283 未加载
评论 #40212338 未加载
评论 #40213246 未加载
评论 #40210516 未加载
评论 #40212866 未加载
评论 #40212433 未加载
评论 #40212436 未加载
kazinator大约 1 年前
Indeed, you can&#x27;t assume UTF-8.<p>What you do, rather, is drop support for non-UTF-8.<p>Work with tech-stacks whose text handling is based strictly around Unicode and UTF-8, and find enough opportunities that way that you don&#x27;t have to care about anything else.<p>Let the customers who cling to data in weird encodings go to someone who makes it a nice to support that.
djha-skin大约 1 年前
Joel spolsky spoke against this exact statistics-based approach when he wrote about unicode[1]:<p>&gt; What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, <i>based on the frequency in which various bytes appear in typical text in typical encodings of various languages</i>, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, <i>until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language</i>, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.<p>1: <a href="https:&#x2F;&#x2F;www.joelonsoftware.com&#x2F;2003&#x2F;10&#x2F;08&#x2F;the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.joelonsoftware.com&#x2F;2003&#x2F;10&#x2F;08&#x2F;the-absolute-minim...</a>
评论 #40207599 未加载
评论 #40208381 未加载
评论 #40206548 未加载
评论 #40206824 未加载
bhaney大约 1 年前
I&#x27;m just gonna assume UTF-8
评论 #40204642 未加载
评论 #40195154 未加载
评论 #40206079 未加载
hnick大约 1 年前
Based on my past role, you can&#x27;t even assume UTF-8 when the file says it&#x27;s UTF-8.<p>Clients would constantly send CSV or other files with an explicit BOM or other marking indicating UTF-8 but the parser would choke since they just output native Windows-1252 or similar into it. I think some programs just spit it out since it&#x27;s standard.
groestl大约 1 年前
I will assume it, I will enforce it where I can, and I will fight tooth and nail should push come to shove.<p>I got 99 problems, but charsets aint one of them.
zadokshi大约 1 年前
Better to assume UTF8 and fail with a clear message&#x2F;warning. Sure you can offer to guess to help the end user if it fails, but as other people have pointed out, it’s been standard for a long time now. Even python caved and accepted it as the default: <a href="https:&#x2F;&#x2F;peps.python.org&#x2F;pep-0686&#x2F;" rel="nofollow">https:&#x2F;&#x2F;peps.python.org&#x2F;pep-0686&#x2F;</a>
Veserv大约 1 年前
Off-topic, but the bit numbering convention is deliciously confusing.<p>Little-endian bytes (lowest byte is leftmost) and big-endian bits (bits contributing less numerical value are rightmost) are normal, but the bits are referenced&#x2F;numbered little-endian (first bit is leftmost even though it contributes the most numerical value). When I first read the numbering convention I thought it was going to be a breath of fresh air of someone using the much more sane, but non-standard, little-endian bits with little-endian bytes, but it was actually another layered twist. Hopefully someday English can write numbers little-endian, which is objectively superior, and do away with this whole mess.
评论 #40204800 未加载
o11c大约 1 年前
Default UTF-8 is better than the linked suggestion of using a heuristic, but failing catastrophic when old data is encountered is unacceptable. There <i>must</i> be a fallback.<p>(Note that the heuristic for &quot;is this intended to be UTF-8&quot; is pretty reliable, but most other encoding-detection heuristics are very bad quality)
lifthrasiir大约 1 年前
You can&#x27;t just assume UTF-8, but you can <i>verify</i> that it is almost surely encoded in UTF-8 unlike other legacy encodings. Which makes UTF-8 the first and foremost consideration.
norir大约 1 年前
If it&#x27;s turtles all the way down and at every level you use utf-8, it&#x27;s hard to see how any input with a different encoding (for the same underlying text) will not be detected before any unintended side effects were invoked.<p>At this point, I don&#x27;t see any sufficiently good reason to not use utf-8 exclusively in any new system. Conversions to and from other encodings would only be done at well defined boundaries when I&#x27;m calling into dependencies that require non utf-8 input for whatever reason.
bandyaboot大约 1 年前
&gt; In the most popular character encoding, UTF-8, character number 65 (&quot;A&quot;) is written:<p>&gt; 01000001<p>&gt; Only the second and final bits are 1, or &quot;on&quot;.<p>Isn’t it more accurate to say that the first and penultimate bits are 1, or “on”?
评论 #40204698 未加载
评论 #40208537 未加载
vitaut大约 1 年前
This is so spectacularly outdated. KOI-8 has been dead for ages.
vkaku大约 1 年前
The probability of web content not in UTF-8 is increasingly getting lower and lower.<p>Last I tracked, as of this month, 0.3% of surveyed web pages used Shift JIS. It has been declining steadily. I really hope people move to UTF-8. While it is important to understand how the code pages and encodings helped, I think it&#x27;s a good time to actually start moving a lot of applications to use UTF-8. I am perfectly okay if people want to use UTF-16 (the OG Unicode) and it&#x27;s extensions alternatively, especially for Asian applications.<p>Yes, historic data preservation requires a different strategy than designing stuff for the future. It is okay to however migrate to these encodings and keep giving old data and software new life.
评论 #40210068 未加载
mihaaly大约 1 年前
Excellent article, good content, good length, enlightened subtexts and references, joy to read.
lolc大约 1 年前
Just the most recent episode: A statistician is using PHP, on Windows, to analyze text for character frequency. He&#x27;s rather confused by the UTF-16LE encoding and thinks the character &quot;A&quot; is numbered 4100 because that&#x27;s what is shown in a hex-editor. I tried explaining about the little-endian part, and mb-string functions in PHP. And that PHP is not a good fit for his projects.<p>Then I realized that this is hilarious and I won&#x27;t be able to kick him from his local minimum there. Everything he could learn about encodings would first complicate his work.
flohofwoe大约 1 年前
The post seems to assume that only UTF-16 has Byte Order Marks, but as pointless as it sounds, UTF-8 has a BOM too (EF BB BF). It seems to be a Windows thing though, haven&#x27;t seen it in the wild anywhere else (and also rarely on Windows - since text editors typically allow to save UTF-8 files with or without BOM. I guess it depends on the text editor which of those is the default).
评论 #40210953 未加载
评论 #40207938 未加载
rob74大约 1 年前
30 years ago: &quot;you can&#x27;t just assume ASCII&quot;<p>Today: &quot;you can&#x27;t just assume UTF-8&quot;<p>The more things change, the more they stay the same...
pronoiac大约 1 年前
Archive copy: <a href="https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20240429061925&#x2F;https:&#x2F;&#x2F;csvbase.com&#x2F;blog&#x2F;9" rel="nofollow">https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20240429061925&#x2F;https:&#x2F;&#x2F;csvbase.c...</a>
drdaeman大约 1 年前
Anyone got EBCDIC on their bingo cards? Because if the argument is &quot;legacy encodings are still relevant in 2024&quot; then we also need to bring EBCDIC (and EBCDIK and UTF-EBCDIC for more perverted fun) into the picture. Makes heuristics extra fun.<p>Or, you know, just say &quot;nah, I can, those ancient stuff don&#x27;t matter (outside of obligatory exceptions, like software archeology) anymore.&quot; If someone wants to feed me a KOI8-R or JIS X 0201 CSV heirloom, they should convert it into something modern first.
评论 #40205197 未加载
评论 #40205133 未加载
iamcreasy大约 1 年前
By heuristics, is the author referring to the rules and policies published by Unicode? [1]<p>Link[1] was referring as a solution to this problem in this article: Absolute Minimum Every Software Developer Must Know About Unicode in 2023 [2].<p>[1] <a href="https:&#x2F;&#x2F;www.unicode.org&#x2F;reports&#x2F;tr29&#x2F;#Grapheme_Cluster_Boundaries" rel="nofollow">https:&#x2F;&#x2F;www.unicode.org&#x2F;reports&#x2F;tr29&#x2F;#Grapheme_Cluster_Bound...</a><p>[2] <a href="https:&#x2F;&#x2F;tonsky.me&#x2F;blog&#x2F;unicode&#x2F;" rel="nofollow">https:&#x2F;&#x2F;tonsky.me&#x2F;blog&#x2F;unicode&#x2F;</a>
mgaunard大约 1 年前
There is a pretty successful world language standard: English.
评论 #40205374 未加载
jujube3大约 1 年前
Actually, I can just assume UTF-8, since that&#x27;s what the world standardized on. Just like I can assume the length of a meter or the weight of a gram. There is no need to have dozens of incompatible systems.
评论 #40204682 未加载
评论 #40204644 未加载
评论 #40204832 未加载
评论 #40204768 未加载
评论 #40204702 未加载
评论 #40204784 未加载
mschuster91大约 1 年前
&gt; CSV files, in particular, have no way to signal, in-band, which encoding is used.<p>That&#x27;s actually wrong. Add an UTF-8 BOM, that&#x27;s enough for Excel (and some other libraries) to know what is going on [1].<p>[1] <a href="https:&#x2F;&#x2F;csv.thephpleague.com&#x2F;8.0&#x2F;bom&#x2F;" rel="nofollow">https:&#x2F;&#x2F;csv.thephpleague.com&#x2F;8.0&#x2F;bom&#x2F;</a>
planede大约 1 年前
A bag of bytes is a bag of bytes. Any encoding should be either assumed by the protocol being used or otherwise specified.
jcranmer大约 1 年前
I haven&#x27;t seen discussion of this point yet, but the post completely fails to provide any data to back up its assertion that charset detection heuristics works, because the feedback I&#x27;ve seen from people who actually work with charsets is that it largely <i>doesn&#x27;t</i> (especially if you&#x27;re based on naive one-byte frequency analysis). Okay, sure, it works if you want to distinguish between KOI8-R and Windows-1252, but what about Windows-1252 and Windows-1257?<p>See for example this effort in building a universal charset detector in Gecko: <a href="https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1551276" rel="nofollow">https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1551276</a>
评论 #40205128 未加载
评论 #40205189 未加载
Karellen大约 1 年前
Don&#x27;t worry, I never assume UTF-8.<p>I <i>require</i> UTF-8. If it isn&#x27;t currently UTF-8, it&#x27;s someone else&#x27;s problem to transform it to UTF-8 first. If they haven&#x27;t, and I get non-UTF-8 input, I&#x27;m fine bailing on that with a &quot;malformed input - please correct&quot; error.
评论 #40204987 未加载
评论 #40204889 未加载
chungy大约 1 年前
The article&#x27;s pretty weird for presenting little-endian UTF-16 as normal and barely even mentioning that big-endian is an option (in fact, seems to refer to it as &quot;backwards&quot;), even though big-endian is a much more human readable format.
pylua大约 1 年前
Stupid question: how are the headers passed for http? What encoding describes the encoding ?
评论 #40206856 未加载
评论 #40206857 未加载
评论 #40206913 未加载
hot_gril大约 1 年前
Java and Javascript both use UTF-16 for internal string representation, even though JSON specifies UTF-8. Windows APIs too. I&#x27;m still not sure why, but it means that one char uses at least 2 bytes even if it&#x27;s in the ASCII range.
评论 #40206898 未加载
dandigangi大约 1 年前
Except I can
GnarfGnarf大约 1 年前
Why would anyone use anything other than UTF-8 in this day and age?<p>Windows took a gamble years ago when the winner was not obvious, we’re stuck with UCS-2, but you can circumvent that with a good string library like QT’s Qstring.
otikik大约 1 年前
However you can check for invalid utf-8 sequences, throw an error with &quot;invalid encoding on byte x, please use valid utf-8&quot; if encountered, and from that point on assume utf-8.
Dwedit大约 1 年前
But you can assume non-UTF-8 upon seeing an invalid UTF-8 byte sequence. From there, it can be application-specific depending on what encoding you expect to see.<p>(And of course UTF-16 if there&#x27;s a BOM)
camgunz大约 1 年前
Dear lazyweb: I think I read something about Postel&#x27;s Law being essential to the internet&#x27;s success -- maybe this was also IPv6 related? Does anyone else remember this?
评论 #40241040 未加载
teliskr大约 1 年前
I maintain a system I created in 2004 (crazy right). Not sure how we lived; but at the time, emojis were not as much of a thing. This has come to bite me several times.
ahi大约 1 年前
Why utf-8 when we have MARC-8? <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;MARC-8" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;MARC-8</a>
评论 #40206647 未加载
tanin大约 1 年前
I&#x27;m actually having this issue where users import CSV files that don&#x27;t seem to be valid. DuckDB would throw out an error like: Invalid Closing Quote: found non trimable byte after quote at line 34, Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in value construction, Value with unterminated quote found<p>One example: pgAdmin can export a database table into a CSV... but the CSV isn&#x27;t valid for DuckDB to consume. Because, for some odd reason, pgAdmin uses a single quote to escape a double quote.<p>This blog is pretty timely. Thank you for writing it!
AzzyHN大约 1 年前
I&#x27;m not a programmer, but how hard is it to write something that just checks what encoding the file uses. Or at least does its best to guess?
评论 #40206590 未加载
评论 #40206865 未加载
评论 #40206532 未加载
Havoc大约 1 年前
You underestimate my willingness to happy path code...
missblit大约 1 年前
Encodings I have used or been exposed to in my career: ASCII, Latin1, Windows-1252, UTF-8, UTF-16, UTF-32, GBK, Zawgyi, Shift-JIS
kdklol大约 1 年前
&gt;You can&#x27;t just assume UTF-8<p>But I will, because in this day and age, I should be perfectly able to do so. Non-use of UTF-8 should be simply considered a bug and not treating text as UTF-8 should frankly be a you problem. At least for anything reasonably modern, by which I mean made in the last 15 years at least.
dublin大约 1 年前
Make your life easy. Assume 7-bit ASCII. No one needs all those other characters, anyway...
评论 #40204909 未加载
评论 #40206338 未加载
rch大约 1 年前
Don&#x27;t assume; force UTF-8.
teknopaul大约 1 年前
I suspect the author does not about the first bit being 1 thing.<p>utf8 is magic.<p>You can assume US ASCII for lots of very useful text protocols , like http and Stomp and not care what the variable string bytes mean.<p>Soooo many software architects don&#x27;t grok the magic of it.<p>You can define 8bit parser that check for &quot;a:&quot;(some string)\n<p>and work with a shit load of human languages.<p>The company I work for does not realise that most of f the 50 year old legacy C it has is fine with utf8 for all the arbitrary fixed length or \0 terminated strings it stores.
neonsunset大约 1 年前
I <i>will</i> assume UTF-8 and if it&#x27;s not, it will be your fault :)
smeagull大约 1 年前
I absolutely can. If it&#x27;s not UTF-8, I assume it&#x27;s worthless.
hchak大约 1 年前
Hoping that LLMs can solve our “Tower of Babel” problem… :)
stalfosknight大约 1 年前
You can on all of Apple&#x27;s platforms.
teddyh大约 1 年前
If you’re actually in a position where you <i>need</i> to <i>guess</i> the encoding, something like “ftfy” &lt;<a href="https:&#x2F;&#x2F;github.com&#x2F;rspeer&#x2F;python-ftfy">https:&#x2F;&#x2F;github.com&#x2F;rspeer&#x2F;python-ftfy</a>&gt; (webapp: &lt;<a href="https:&#x2F;&#x2F;ftfy.vercel.app&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ftfy.vercel.app&#x2F;</a>&gt;) is a perfectly reasonable choice.<p>But, you should always do your absolute utmost <i>not</i> to be put in a situation where guessing is your only choice.
eqvinox大约 1 年前
The solution, obviously, is to train an LLM to recognize the character set.
评论 #40210104 未加载
koito17大约 1 年前
The comments in this thread are a bit amusing.<p>I wish I could live in the world where I could bluntly say &quot;I will assume UTF-8 and ignore the rest of the world&quot;. Many Japanese documents and sites still use Shift JIS. Windows has this strange Windows-932 format that you will frequently encounter in CUE files outputted by some CD ripping software. ARIB STD-B24, the captioning standard used in Japanese television, has its own text encoding with characters not found in either JIS X 0201 or JIS X 0208. These special characters are mostly icons used in traffic and weather reports, but transcoding to UTF-8 still causes trouble with these icons.
评论 #40205079 未加载
评论 #40205725 未加载
评论 #40205037 未加载
jheriko大约 1 年前
no mention of the BOM... tragic.
mseepgood大约 1 年前
Yes, I can.
timoteostewart大约 1 年前
Fascinating topic. There are two ways the user&#x2F;client&#x2F;browser receives reports about the character encoding of content. And there are hefty caveats about how reliable those reports are.<p>(1) First, the Web server usually reports a character encoding, a.k.a. charset, in the HTTP headers that come with the content. Of course, the HTTP headers are not part of the HTML document but are rather part of the overhead of what the Web server sends to the user&#x2F;client&#x2F;browser. (The HTTP headers and the `head` element of an HTML document are entirely different.) One of these HTTP headers is called Content-Type, and conventionally this header often reports a character encoding, e.g., &quot;Content-Type: text&#x2F;html; charset=UTF-8&quot;. So this is one place a character encoding is reported.<p>If the actual content is <i>not</i> an (X)HTML file, the HTTP header might be the only report the user&#x2F;client&#x2F;browser receives about the character encoding. Consider accessing a plain text file via HTTP. The text file isn&#x27;t likely to itself contain information about what character encoding it uses. The HTTP header of &quot;Content-Type: text&#x2F;plain; charset=UTF-8&quot; might be the only character encoding information that is reported.<p>(2) Now, if the content is an (X)HTML page, a charset encoding is often also reported in the content itself, generally in the HTML document&#x27;s head section in a meta tag such as &#x27;&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text&#x2F;html; charset=utf-8&quot;&#x2F;&gt;&#x27; or &#x27;&lt;meta charset=&quot;utf-8&quot;&gt;&#x27;. Now just because an HTML document self-reports that it uses a UTF-8 (or whatever) character encoding, that&#x27;s hardly a guarantee that the document does in fact use said character encoding.<p>Consider the case of a program that generates web pages using a boilerplate template still using an ancient default of ISO-8859-1 in the meta charset tag of its head element, even though the body content that goes into the template is being pulled from a database that spits out a default of utf-8. Boom. Mismatch. Janky code is spitting out mismatched and inaccurate character encoding information every day.<p>Or to consider web servers. Consider a web server whose config file contains the typo &quot;uft-8&quot; because somebody fat-fingered while updating the config (I&#x27;ve seen this in random web pages.). Or consider a web server that uses a global default of &quot;utf-8&quot; in its outgoing HTTP headers even when the content being served is a hodge-podge of UTF-8, WINDOWS-1251, WINDOWS-1252, and ISO-8859-1. This too happens all the time.<p>I think the most important takeaway is that with both HTTP headers and meta tags, there&#x27;s no intrinsic link between the character encoding being reported and the <i>actual</i> character encoding of the content. What a Web server tells me and what&#x27;s in the meta tag in the markup just count as two reports. They might be accurate, they might not be. If it really matters to me what the character encoding is, there&#x27;s nothing for it but to determine the character encoding myself.<p>I have a Hacker News reader, <a href="https:&#x2F;&#x2F;www.thnr.net" rel="nofollow">https:&#x2F;&#x2F;www.thnr.net</a>, and my program downloads the URL for every HN story with an outgoing link. I have seen binary files sent with a &quot;UTF-8&quot; Content-Type header. I have seen UTF-8 files sent with a &quot;inode&#x2F;x-empty&quot; Content-Type header. My logs have literally hundreds of goofy inaccurate reports of content types and character encodings. Because I&#x27;m fastidious and I want to know what a file actually is, I have a function `get_textual_mimetype` that analyzes the content of what the URL&#x27;s web server sends me. My program downloads the content and uses tools such as `iconv` and `isutf8` to get some information about what encoding it might be. It uses `xmlwf` to check if it&#x27;s well-formed XML. It uses `jq` to check whether it&#x27;s valid JSON. It uses `libmagic`. There&#x27;s a lot of fun stuff the program does to pin down with a high degree of certainty what the content is. I want my program to know whether the content is an application&#x2F;pdf, an iamge&#x2F;webp, a text&#x2F;html, an application&#x2F;xhtml+xml, a text&#x2F;x-csrc, or whatever. Only a rigorous analysis will tell you the truth. (If anyone is curious, the source for `get_textual_mimetype` is in the repo for my HN reader project: <a href="https:&#x2F;&#x2F;github.com&#x2F;timoteostewart&#x2F;timbos-hn-reader&#x2F;blob&#x2F;main&#x2F;utils_mimetypes_magic.py">https:&#x2F;&#x2F;github.com&#x2F;timoteostewart&#x2F;timbos-hn-reader&#x2F;blob&#x2F;main...</a> )
matheusmoreira大约 1 年前
Yeah but I&#x27;m gonna do it anyway. If it&#x27;s not UTF-8 it&#x27;s terrible and broken and not worth supporting unless some serious cash is on the table.
klysm大约 1 年前
Meh I&#x27;d rather start making those assumptions and blame everything that doesn&#x27;t use UTF-8 as broken.
AlienRobot大约 1 年前
&quot;You can&#x27;t assume a 32 bit integer starts from 0&quot;
jheriko大约 1 年前
Fuck me it&#x27;s 2024<p>2014 me shaking his head in ways 2004 me saw coming in 1994.<p>Im only 40.<p>FUCK
TypicalHog大约 1 年前
IMO humanity should just do ASCII and call it a day. I&#x27;m not talking about viability of this, just my wishes.