TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Comma Separated Values (CSV) to Unicode Separated Values (USV)

208 点作者 jph大约 1 年前

42 条评论

jonathaneunice大约 1 年前
Fascinated this uses the Unicode glyphs &#x2F; symbols for unit and record separator rather than the unit and record separators themselves (ASCII US and RS).<p>Perfect deployment of David Wheeler&#x27;s aphorism:<p>&gt; All problems in computer science can be solved by adding another level of indirection.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;David_Wheeler_(computer_scientist)" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;David_Wheeler_(computer_scient...</a>
评论 #39680916 未加载
评论 #39828082 未加载
评论 #39680504 未加载
评论 #39680120 未加载
评论 #39680281 未加载
评论 #39680884 未加载
评论 #39685187 未加载
评论 #39683658 未加载
tambourine_man大约 1 年前
ASCII has a field delimiter character. The fact that we chose comma and tabs because a field delimiter character is hard to type or see is one of those things that saddens me in computing.<p>Imagine the amount of pain that could have been spared if we had done it right from the start some 50 years ago.
评论 #39680489 未加载
评论 #39686356 未加载
评论 #39680092 未加载
评论 #39679904 未加载
评论 #39679883 未加载
评论 #39680432 未加载
vidarh大约 1 年前
Their examples if anything convinced me not to use this for a long time.<p>I need to zoom to be able to tell these apart, so I&#x27;ll need editor support for it to be convenient to work with these anyway. And then clicking through to the comparisons, it demonstrates the difference <i>existing support for CSV &quot;everywhere&quot;</i> makes - Github renders the CSV examples nicely as tables, while again I need to zoom in to see which separator is which for USV.<p>Maybe once there is widespread editor support. But if you need editor support for it to be comfortable anyway, then the main benefit vs. using the old-school actual separator characters goes out the window.
评论 #39684832 未加载
评论 #39680249 未加载
评论 #39680581 未加载
jiehong大约 1 年前
For those wondering what USV is, like myself:<p>&gt; Unicode separated values (USV) is a data format that uses Unicode symbol characters between data parts. USV competes with comma separated values (CSV), tab separated values (TSV), ASCII separated values (ASV), and similar systems. USV offers more capabilities and standards-track syntax.<p>&gt; Separators:<p>&gt;<p>&gt; ␟ U+241F Symbol for Unit Separator (US)<p>&gt;<p>&gt; ␞ U+241E Symbol for Record Separator (RS)<p>&gt;<p>&gt; ␝ U+241D Symbol for Group Separator (GS)<p>&gt;<p>&gt; ␜ U+241C Symbol for File Separator (FS)<p>&gt;<p>&gt; Modifiers:<p>&gt;<p>&gt; ␛ U+241B Symbol for Escape (ESC)<p>&gt;<p>&gt; ␗ U+2417 Symbol for End of Transmission Block (ETB)<p>&gt;<p>&gt; ␖ U+2416 Symbol For Synchronous Idle (SYN)
评论 #39679875 未加载
评论 #39679828 未加载
评论 #39686440 未加载
评论 #39679654 未加载
pquki4大约 1 年前
The usv github repository says it is &quot;the standard for data markup of ...&quot;, has 66 stars, and is <i>currently</i> applying for &quot;text&#x2F;usv&quot; MIME type. That&#x27;s all about it.<p>Maybe I&#x27;ll consider it when it does not belong to a company, has two more zeros in the number of stars, and has RFC&#x2F;ISO attached to it. Because right now it is not much more of a &quot;standard&quot; than a hobby project I create on a whim.
评论 #39691427 未加载
评论 #39680715 未加载
评论 #39685766 未加载
eli大约 1 年前
Not sure I understand the advantage over ASCII Separated Values (ASV) which use ASCII control characters 0x1E and 0x1F
评论 #39679702 未加载
评论 #39679693 未加载
philsnow大约 1 年前
&gt; The Synchronous Idle (SYN) symbol is a heartbeat, and is especially useful for streaming data, such as to keep a connection alive. &gt; &gt; SYN tells the data reader that data streaming is still in progress. &gt; &gt; SYN has no effect on the output content. &gt; &gt; Example of a unit that contains a Synchronous Idle: &gt; &gt; a␖b␞<p>Why would this go in-band inside a document format? Just why? If you want keep-alives, use a kind of connection that supports out-of-band keepalives.<p>If you download the same document twice, and the second time the server is heavily loaded (or it&#x27;s waiting on some dependency, or whatever), presumably the server will helpfully generate some SYNs in the middle of the document to keep the connection alive (?), but now you&#x27;ve got the same document &quot;spelled&quot; two different ways, that won&#x27;t checksum alike.<p>SYN along with the weirdness of<p>&gt; Escape + [non-USV-special] character: the character is ignored<p>means that you have arbitrarily many ways of writing semantically-same documents.
评论 #39685803 未加载
评论 #39691386 未加载
SuperHeavy256大约 1 年前
I&#x27;ve long wanted a successor to CSV, but this is kinda stupid. People like CSVs because they look good, feel natural even in plaintext. This is the same reason that Markdown in successful.<p>As for including commas in your data, it could just have been managed with a simple escape character like a \, for when there&#x27;s actually a comma in your data. That&#x27;s it.
评论 #39684147 未加载
评论 #39680822 未加载
评论 #39683070 未加载
forgetfulness大约 1 年前
Seems complex enough that you&#x27;d only manipulate files in this format by serializing through a tool, and by then it&#x27;s competing with established binary formats rather than CSV.
bombledmonk大约 1 年前
I&#x27;ve actually been employing Emoji Separated Values (ESV), often , here and there when doing some of this kind of work. Granted, it&#x27;s not standard, but it&#x27;s been really useful when I&#x27;ve needed it.<p>*edit Apparently emojis don&#x27;t fly here, but it was an index finger pointing right.
评论 #39682334 未加载
评论 #39680276 未加载
crq-yml大约 1 年前
It&#x27;s sensible in principle:<p>* Editors will play nicely with the graphical representation. If you need better graphics, it&#x27;s done with font customization, which everyone already supports.<p>* It announces that the data is source text, vs transmitted bytes. The type&#x2F;token distinction is not easy to overcome.<p>* It sits way out in Unicode&#x27;s space where a collision is unlikely. The whole reason why CSV-type formats create frustration is because the tooling is ad-hoc, never does the right thing and uses the lower byte spaces where people stuff all kinds of random junk. This is the &quot;fuck it, you get the same treatment as a Youtube video id&quot; kind of solution.<p>That said, if used, someone will attack it by printing those characters as input.
yewenjie大约 1 年前
I&#x27;m still confused whether this is a joke or not.
评论 #39679981 未加载
评论 #39679675 未加载
评论 #39679665 未加载
评论 #39680026 未加载
code-faster大约 1 年前
CSV is great because excel can import it, but it can&#x27;t import USV, so at that point, why use USV when you can use JSON?<p><a href="https:&#x2F;&#x2F;github.com&#x2F;tyleradams&#x2F;json-toolkit&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;tyleradams&#x2F;json-toolkit&#x2F;</a>
评论 #39680154 未加载
评论 #39680256 未加载
nilslice大约 1 年前
If you would like to run csv-to-usv from 15+ languages (not only rust!) then check out this demo I made, converting the library to an Extism plugin function: <a href="https:&#x2F;&#x2F;github.com&#x2F;extism&#x2F;extism-csv-to-usv">https:&#x2F;&#x2F;github.com&#x2F;extism&#x2F;extism-csv-to-usv</a><p>Here&#x27;s a snippet that runs it in your browser:<p><pre><code> &#x2F;&#x2F; Simple example to run this in your browser! But will work in Go, PHP, Ruby, Java, Python, etc... const extism = await import(&quot;https:&#x2F;&#x2F;esm.sh&#x2F;@extism&#x2F;extism&quot;); const plugin = await extism.createPlugin(&quot;https:&#x2F;&#x2F;cdn.modsurfer.dylibso.com&#x2F;api&#x2F;v1&#x2F;module&#x2F;a28e7322a6fde92cc27344584b5e86c211dbd5a345fe6ec95f1389733c325541.wasm&quot;, { useWasi: false } ); let out = await plugin.call(&quot;csv_to_usv&quot;, &quot;a,b,c&quot;); console.log(out.text());</code></pre>
评论 #39682276 未加载
评论 #39681465 未加载
pimlottc大约 1 年前
&gt; Is USV aiming to become a standard? &gt; &gt; Yes and we&#x27;ve submitted the first draft of the USV standard to the IETF: link.<p>This is a nice idea, and all, but seems unlikely to become a meaningful standard without some major backing behind that &quot;we&quot;.
jefftk大约 1 年前
Description of USV: <a href="https:&#x2F;&#x2F;github.com&#x2F;sixarm&#x2F;usv">https:&#x2F;&#x2F;github.com&#x2F;sixarm&#x2F;usv</a>
otabdeveloper4大约 1 年前
Absolutely terrible documentation. The RFC doesn&#x27;t even explain the purpose of the &quot;End of Transmission Block&quot; token.
评论 #39691450 未加载
michaelmior大约 1 年前
If I understand the API correctly from my brief glance, the crate returns a triply-nested vector with the outermost vector being the equivalent of CSV rows, then CSV columns, then &quot;units&quot; which don&#x27;t have a direct CSV equivalent. It would be helpful if there was an API method that returned results without this final level of nesting, perhaps panicking if there is more than one unit. This would make it easier to deal with the common case (in CSV at least) where each column only has a single value.
评论 #39680235 未加载
评论 #39691512 未加载
评论 #39687094 未加载
Fileformat大约 1 年前
A similar concept that is (IMHO) much nicer: RSV<p>It doesn&#x27;t need any escaping or quoting: a field just has to be valid UTF-8.<p>The trick is that the delimiters are bytes that are invalid UTF-8.<p>The spec fits on a napkin, parsing is trivial, you can jump to the middle of a doc and find the nearest row, etc.<p>Main downside is you need an editor&#x2F;viewer that can handle it.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;Stenway&#x2F;RSV-Specification">https:&#x2F;&#x2F;github.com&#x2F;Stenway&#x2F;RSV-Specification</a>
评论 #39691481 未加载
codeulike大约 1 年前
CSV is like an invasive plant species, or perhaps a curse; you&#x27;re never going to be able to root it out even thought there are a billion better data formats.
评论 #39680275 未加载
评论 #39679852 未加载
zzo38computer大约 1 年前
I have seen Unicode Separated Values. I don&#x27;t like Unicode and I even more don&#x27;t like USV. I like ASCII Separated Values, which can encode each separator as a single byte, and can be used with character encodings other than Unicode (and, even if you do use it with Unicode, does not prevent you from using the Unicode control pictures in your data; USV does prevent you from using those characters in your data even though the data is (allegedly) Unicode).<p>What they say about display and input really depends on the specific editors and viewers that you are using (and perhaps on the fonts as well). When I use vi, I have no difficulty entering ASCII control characters in the text. However, there is also the problem with line breaking, with ASV and with USV, anyways; and they do mention this in the issues anyways.<p>Fortunately, I can write a program to convert these formats without too much difficulty, even without implementing Unicode (since it is a fixed sequence of bytes that will need to be replaced; however, it does mean that it will need to read multiple bytes to figure out whether or not it is a record separator, which is not as simple as ASV).
jbaber大约 1 年前
I&#x27;ve been using an emoji separated values format for a personal project where fields contain lots of special characters including whitespace.<p>I&#x27;d previously given up using ASV because of the printability and copy&#x2F;paste problems described. Replacing the control characters by their printable glyphs solves all my previous problems and is as genius as it is naughty.<p>I sympathize with the arguments people here present against and agree the SYN character and Group Separator are weird -- but cause no harm. I&#x27;m not bothered by the same data having multiple representations since I&#x27;m insisting on human readability rather than byte-by-byte perfection in the first place.<p>It took 20 minutes to convert my project and I&#x27;m very happy.<p>Only tooling change I had to make was adding digraphs to vim<p>digraph rs 9246 us 9247<p>etc. Easy to type directly in my .usv file. Easy to type and read in some Python consuming it.<p>Regardless of it becoming a standard and my lingering grouchiness about multi-byte characters, needing to use non-xterm, etc. this works very well for me.
评论 #39691440 未加载
evrimoztamur大约 1 年前
First time hearing about USV, nifty! However, I think the adoptability challenge remains here to be Excel support (very tough).
评论 #39679874 未加载
ochrist大约 1 年前
If you live in a place where comma is the decimal separator, your CSV files will often use semicolon as the separator instead of comma. Will this tool cater for that?
评论 #39679829 未加载
评论 #39698736 未加载
evnix大约 1 年前
There are some well researched alternatives to CSV,<p>From the top of my head, I can highly recommend SML<p><a href="https:&#x2F;&#x2F;dev.stenway.com&#x2F;SML&#x2F;SimpleML.html" rel="nofollow">https:&#x2F;&#x2F;dev.stenway.com&#x2F;SML&#x2F;SimpleML.html</a><p>Recommend watching the, &#x27;stop using CSV video&#x27; too<p><a href="https:&#x2F;&#x2F;youtu.be&#x2F;mGUlW6YgHjE?si=zDG_9Jv8LSy-ttP4" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;mGUlW6YgHjE?si=zDG_9Jv8LSy-ttP4</a>
评论 #39691565 未加载
1vuio0pswjnm7大约 1 年前
Text-only, no Javascript:<p><a href="https:&#x2F;&#x2F;static.crates.io&#x2F;readmes&#x2F;csv-to-usv&#x2F;csv-to-usv-1.1.2.html" rel="nofollow">https:&#x2F;&#x2F;static.crates.io&#x2F;readmes&#x2F;csv-to-usv&#x2F;csv-to-usv-1.1.2...</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;sixarm&#x2F;csv-to-usv-rust-crate&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;sixarm&#x2F;csv-to-usv-rust-crate&#x2F;</a>
nayuki大约 1 年前
Nope, this isn&#x27;t a good approach. I prefer tab-separated values (TSV) and use it as much as possible.
sukmaagung大约 1 年前
Instead of using separator-character:<p>a &lt;comma&gt; b &lt;comma&gt; c &lt;enter&gt; d &lt;comma&gt; e &lt;comma&gt; f<p>why not using header-character:<p>&lt;row&gt;&lt;cell&gt; a &lt;cell&gt; b &lt;cell&gt; c &lt;row&gt;&lt;cell&gt; d &lt;cell&gt; e &lt;cell&gt; f
tamimio大约 1 年前
I am uncertain, but this is likely to reintroduce the issue of Unicode buffer overflow into the mainstream. What are your proposed solutions, considering it is expected to become standardized?
评论 #39698760 未加载
isoprophlex大约 1 年前
This is just ESV files with extra complexity!<p>ESV: eggplant-separated values. Because who is ever going to put AUBERGINE (U+1F346) into a dataset? It&#x27;s the perfect record separator!
评论 #39685953 未加载
hermitcrab大约 1 年前
Alternatives to CSV are also covered in length at:<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=31220841">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=31220841</a>
difer7大约 1 年前
Does USV supports nested fields? While reading the USV GitHub&#x27;s README I did not clearly understand the purpose of the &quot;group separator&quot;
评论 #39684396 未加载
teddyh大约 1 年前
This is needlessly adding yet another standard¹ to the mix. If you are in a position to choose what standard you use, just use:<p>• Whatever is best for the data model and&#x2F;or languages you use. JSON is a common modern choice, suitable for most things.<p>• If you want something more tabular, closer to CSV (which is a valid choice for bulk data), use strict RFC 4180 compliant data.<p>• If you want to specify your own binary super-compact data, use ASN.1. I am also given to understand that Protobuf is a popular modern choice.<p>If you <i>aren’t</i> in a position to choose your standards, just do whatever you need to do to parse whatever junk you are given, and emit as standards-compliant data as possible as output; again, RFC 4180 is a great way to standardize your own CSV output, as long as you stick to a subset which the receiving party can parse.<p>Nobody needs “USV”, and nobody should use it.<p>1. &lt;<a href="https:&#x2F;&#x2F;xkcd.com&#x2F;927&#x2F;" rel="nofollow">https:&#x2F;&#x2F;xkcd.com&#x2F;927&#x2F;</a>&gt;
评论 #39691580 未加载
ggm大约 1 年前
conversion of file encoding from simple ASCII to UTF-8 has consequences beyond the field&#x2F;record problem.<p>Some tools will randomly convert &quot; to &#x27;LEFT DOUBLE QUOTATION MARK&#x27; and &#x27;RIGHT DOUBLE QUOTATION MARK&#x27; if they see UTF-8 flagging. Thus, the file is converted without your voluntary participation.
rgmerk大约 1 年前
I think I&#x27;d prefer to wear out my keyboard typing XML tags than deal with this.
_obviously大约 1 年前
Unicode is Turing complete which makes it an attack vector.
评论 #39680901 未加载
hughw大约 1 年前
USV is doomed because Worse is Better[1] (edit: fix url)<p>[1] <a href="https:&#x2F;&#x2F;dreamsongs.com&#x2F;RiseOfWorseIsBetter.html" rel="nofollow">https:&#x2F;&#x2F;dreamsongs.com&#x2F;RiseOfWorseIsBetter.html</a>
samatman大约 1 年前
Y&#x27;know, I greatly dislike this. It&#x27;s an actual emotional reaction. This should not be standardized. No one should use this. This is a bad idea and deserves to die in obscurity.<p>I&#x27;ll tell you why, it&#x27;s pretty simple. The characters this... thing is stealing, exist to represent invisible control sequences. That is their <i>use</i>. The fact that they can be <i>mentioned</i> by direct input is inevitable, but not to be encouraged.<p>I will be greatly disappointed if this is accepted as a standard. The fact that a USV file looks like a rendered ASV file is a show stopping bug, an anti-feature, an insult to life itself. Kill it with fire.
评论 #39687857 未加载
two_handfuls大约 1 年前
Nice! Is there a Python library?
remram大约 1 年前
Why not use parquet at this point? (or a row-oriented equivalent like Avro or SQLite)<p>If you don&#x27;t have a human-readable file, might as well be compressible, queriable, and metadata-enabled I think.
justtinker大约 1 年前
This is the XKCD comic in action. <a href="https:&#x2F;&#x2F;xkcd.com&#x2F;927&#x2F;" rel="nofollow">https:&#x2F;&#x2F;xkcd.com&#x2F;927&#x2F;</a><p>Someone should write a family of filters of the form CSV2ASV, CSV2USV, CSV2JSON ,USV2XML , TOML2USV, USV2Cuneiform.......
MrOxiMoron大约 1 年前
&#x2F;me looks at calendar, nope not April 1st yet.