ASCII Delimited Text – Not CSV or Tab Delimited Text

114 点作者 ejstronge6 个月前

26 条评论

Ukv6 个月前

> with no restrictions on the text in fields or the need to try and escape characters.Maybe I'm missing something, but wouldn't it still need escaping for those ASCII separator characters (or alternatively, a restriction for the stored text not to have them)?It's true that having to deal with escaping much less often (since the ASCII separator characters are rarer than commas/quotes) would be convenient for manual reading/writing, but I feel that's canceled out by the characters being hard to type/see (likely the reason why they're rare) - and it wouldn't necessarily save on writer/parser code complexity.

评论 #42101034 未加载

评论 #42101086 未加载

评论 #42102086 未加载

评论 #42101037 未加载

评论 #42101036 未加载

jmclnx6 个月前

I like this format best of all, CSV is # 2 favorite.At work we settled on using ^G (0x07) as a delimiter instead of TABs for file transfers and loading data into various databases.The reason was Excel. People/systems who create these files sometimes source from Excel. And Excel can have a habit of placing odd characters in text fields. We found the one character never encountered was BEL.For text fields we tend to remove embedded white space and after replacing TABs with 1 space.

评论 #42100978 未加载

aristus6 个月前

In the early 2000s, back at the beginning of the world, Yahoo's web code used ^A and ^B for field and record separators to avoid having to escape commas and quotes and newlines. That was probably the last time I ever saw ASCII control characters used as intended in the wild.There is no technical reason why CSV should have won out, except that keyboards have a comma key and almost never a ^A key.

评论 #42101008 未加载

评论 #42100954 未加载

评论 #42101038 未加载

tangus6 个月前

And how do we escape those characters? With ESC (27)? Inside a SI/SO (15/14) pair?I think CSV or TSV are good enough. People keep trying to find a format where you can separate the records and fields with a simple string.split and there's no need to contemplate escapes.But that's not possible, no matter the format you'll have to parse it right. And then, a format that uses visual delimiters has the obvious advantage of being editable with any text editor.

NBJack6 个月前

Kind of a short sighted take. Sticking special characters that (in many early editors) would be invisible complicates development and maintenance. Even tabs have a visual, albeit inconsistent (if your editor wants to align columns for you) manifestation you can work with.Technically, XML is superior for data representation on many fronts. But likewise, it is an absolute PITA to maintain without significant editor support.It is no accident that CSV/tabs 'won'.

评论 #42100921 未加载

评论 #42100846 未加载

dahart6 个月前

> The most anoying thing about the whole problem is that it was solved by design in the ASCII character set.This is a great example of not understanding what “the problem” actually is, and then assuming that because part of a technical solution exists, that everyone should be using it and if they’re not it’s because of ignorance rather than choice. I think we all do this, at least I know I’m sometimes guilty, but it’s amusing when faced with what happens in the real world at scale, to jump to the conclusion that the world is wrong rather than to first question our own assumptions.Personally, I think it’s funny to assume that ASCII == text. Obviously not all ASCII is “text” in the sense that most people will assume. When people say “text file” I assume it contains nothing that you can’t type on a physical typewriter, other than the annoying and persistent difference between LF and CRLF. ASCII has lots of characters you can’t type on a typewriter, and are not intended to print as a character.But if you want to invent new “text” characters for a “text” file, the problem suddenly becomes not just having a char code, but how to easily type it, how to easily display it, how to teach everyone to recognize and use it, and how to standardize these things so everyone knows them. Personally at this point I probably wouldn’t call a file with ASCII chars 28..31 in them “text”. The ASCII characters haven’t solved the overall problem, they have created several more and bigger problems that remain unsolved, and are much easier to solve in practice by using a comma instead, which is why people aren’t using the special ASCII characters in practice.

spiffytech6 个月前

Some notes from when the USV project tried using control characters:> We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.> First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.> Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters.> Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content).<a href="https://github.com/SixArm/usv/tree/main/doc/faq#why-use-control-picture-characters-rather-than-the-control-characters-themselves">https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...</a>

foxglacier6 个月前

If people used these for ASCII delimited text, they'd have to not use them for anything else, like some other text format otherwise you might insert an entire ASCII delimited file into a text field of that other thing and break that other thing's parsing. You couldn't even insert part of a file into a string field in another ASCII-delimited file. You only get to use them once so they wouldn't be part of general purpose plain text and an ASCII delimited file wouldn't be a plain text file that you could treat in the same way as other text files, so it's effectively a binary format or has restrictions on what text characters can appear in its records without escaping - oh no, that was its entire value proposition!

评论 #42101016 未加载

haddr6 个月前

The fact that CSV is still strong is that it already covers all „shortcomings” (I.e. presence of quotations in the content) mentioned by this article.

评论 #42100844 未加载

directevolve6 个月前

How big an issue is CSV format really? I work in bioinformatics where it seems like everything is one odd CSV-like format or another. In Python, I have access to tools like pandas, duckdb, and polars, which have detailed ingestion options and sometimes a sniffer. I can read part of a file and check in seconds if it looks right.Dealing with the variety of formats certainly isn’t the bottleneck in my productivity. Is it for others? I’d be curious why.

评论 #42101386 未加载

评论 #42101165 未加载

runarberg6 个月前

I’m working on a PWA which includes a dictionary search[1] feature and only a static web server (so no server side database to optimize the search). I did want searching to work in offline mode anyway. I decided it was best to generate an index file which the users download on first visit. For some reason I found USV[2] to be the best fit for this. USV I think allows seperating with ASCII control characters, but I used the unicode variants (␟, ␞, and ␝).I really liked this as it allowed me to add the glossary as an array in one of the columns. I wrote the parser my self which searches through the text structure, and it was simple enough. The reason I opted not to use a CSV or a TSV was that I didn’t want to deal with escaping surprise commas or tabs I would find in the dictionary data plus the extra dimension was nice. Since the file is generated, I didn’t have to type the characters my self so it had none of the downsides of this format honestly.1: <a href="https://shodoku.app/dictionary" rel="nofollow">https://shodoku.app/dictionary</a>2: <a href="https://github.com/SixArm/usv">https://github.com/SixArm/usv</a>

Apreche6 个月前

The shortcoming of using the control characters is that there is no easy way to type them on a keyboard. You can trivially edit csv in a text editor.

评论 #42100861 未加载

评论 #42100818 未加载

评论 #42100715 未加载

评论 #42100708 未加载

评论 #42100836 未加载

评论 #42100751 未加载

1vuio0pswjnm76 个月前

"Then you have a text file format that is trivial to write out and read in, with no restrictions on the text in fields or the need to try and escape characters."Not being a "developer", I have been productively using these non-printing separators for personal use as a UNIX-like OS and text-only internet user for close to three decades. Of course I have a bias for ASCII and against Unicode and I only use the English language for computing. Perhaps this is why using the ASCII charactors, including the record and file separators, work so well for me.Using ASCII non-printing separators might not work for everybody but it would be false to assume it will not work for anybody.Historically ASCII worked for some computer users. It still does today. For those who stil use it like myself.The author states, "The most anoying[sic] thing about the whole problem is that it was solved by design in the ASCII character set.""Developers" might not use the ASCII solution but that does not prevent other computer owners from using it.

zaxomi6 个月前

I sometimes use them for machine to machine transfer. The biggest problem is that regular editors don't handle it in a sensible way.

croes6 个月前

CSV isn‘t that complicated if done right.1. if a value included the line separator, row separator or text qualifier surround the value with the text qualifier.2. if the value contains the text qualifier double it in the value.

评论 #42100900 未加载

评论 #42101065 未加载

评论 #42101028 未加载

theandrewbailey6 个月前

I've used these when I've had some code with thousands of strings. I concatenated them with the ASCII separators in the source code, then called String.split as needed. The speedup was noticeable, probably since the runtime choked on instantiating so many strings at one time when launched.

评论 #42100979 未加载

评论 #42100783 未加载

bradley136 个月前

Nice idea, but as others have pointed out, non-printable characters pose their own problems. People expect to be able to edit CSV files.Someone mentioned XML, but for most use cases XML is stupidly over-engineered. JSON is simpler - the entire specification is just a dozen or so pages.

评论 #42101200 未加载

robsh6 个月前

All we need is native Excel support, and HTML5 web support. In web browsers it should be the default copy formatting, and if you’re writing an HTML document these characters should be an alternative to using TD and TR tags.

jiehong6 个月前

Perhaps we should someday have length delimited text formats, and editors should recalculate the length on the fly.Something like:5:hello2:piMaybe with one blank line with no delimiter as a record separator.All fields on the same line could work, and would be more greppable, but harder to read for humans.

评论 #42101113 未加载

评论 #42100788 未加载

评论 #42100843 未加载

评论 #42102352 未加载

chrishill896 个月前

You can use ASCII-separated values in qsv.[1]For the unlikely event that you are dealing with data with the metacharacters: qsv will use some other control character as the “quote” character to deal with that.

评论 #42106219 未加载

calibas6 个月前

I think this would catch on much more quickly if text editors treated the Record Separator character as a new line, and there was a special character for the Unit Separator.

mannyv6 个月前

Tab and commas are ascii characters, so a csv file and a tdf are ascii-delimited by definition.This lack of precision in writing is annoying.

tpoacher6 个月前

people saying \034 / \035 are not readable / printable so they don't make good human readable delimiters: make it ,\034 and \n\035. looks like csv, but is actually ascii delimited. just remove last character from all entries.

apitman6 个月前

Would love to see an explanation and some examples of what this would look like to work with for common use cases.

gabrielsroka6 个月前

2009. has been shared here many times before

ribcage6 个月前

plaintext is obsolete. Only good for storing passwords.