To think about the difference between serialization formats, here's an analogy I hope will help.<p>Protocol Buffers (and I think Thrift, and maybe Avro) are sort of like C or C++: you declare your types ahead of time, and then you take some binary payload and "cast" it (parse it actually) into your predefined type. If those bytes weren't actually serialized as that type, you'll get garbage. On the plus side, the fact that you declared your types statically means that you get lots of useful compile-time checking and everything is really efficient. It's also nice because you can use the schema file (ie. .proto files) to declare your schema formally and document everything.<p>JSON and Ion are more like a Python/Javascript object/dict. Objects are just attribute-value bags. If you say it has field fooBar at runtime, now it does! When you parse, you don't have to know what message type you are expecting, because the key names are all encoded on the wire. On the downside, if you misspell a key name, nothing is going to warn you about it. And things aren't quite as efficient because the general representation has to be a hash map where every value is dynamically typed. On the plus side, you never have to worry about losing your schema file.<p>I think this is a case where "strongly typed" isn't the clearest way to think about it. It's "statically typed" vs. "dynamically typed" that is the useful distinction.
Finally! I've had to live the JSON nightmare since I left Amazon.<p>Some of the benefits over JSON:<p>* Real date type<p>* Real binary type - no need to base64 encode<p>* Real decimal type - invaluable when working with currency<p>* Annotations - You can tag an Ion field in a map with an annotation that says, e.g. its compression ("csv", "snappy") or its serialized type ('com.example.Foo').<p>* Text and binary format<p>* Symbol tables - this is like automated jsonpack.<p>* It's self-describing - meaning, unlike Avro, you don't need the schema ahead of time to read or write the data.
I Consider this Harmful (TM) and will oppose the adoption in every organization where I have an opportunity to voice such. (In its present form, to be clear!)<p>There is no need to have a null which is fragmented into null.timestamp, null.string and whatever. It will complicate processing. Just because you know the type of some element is timestamp, you must worry whether or not it is null and what that means.<p>There should be just one null value, which is its own type. A given datum is either permitted to be null OR something else like a string. Or it isn't; it is expected to be a string, which is distinct from the null value; no string is a null value.<p>It's good to have a read notation for a timestamp, but it's not an elementary type; a timestamp is clearly an aggregate and should be understood as corresponding to some structure type. A timestamp should be expressible using that structure, not only as a special token.<p>This monstrosity is not exhibiting good typing; it is not good static typing, and not good dynamic typing either. Under static typing we can have some "maybe" type instead of null.string: in some representations we definitely have a string. In some other places we have a "maybe string", a derived type which gives us the possibility that a string is there, or isn't. Under dynamic typing, we can superimpose objects of different type in the same places; we don't need a null version of string since we can have "the" one and only null object there.<p>This looks like it was invented by people who live and breathe Java and do not know any other way of structuring data. Java uses statically typed references to dynamic objects, and each such reference type has a null in its domain so that "object not there" can be represented. But just because you're working on a reference implementation in such a language doesn't mean you cannot <i>transcend</i> the semantics of the implementation language. If you want to propose some broad interoperability standard, you practically <i>must</i>.
This reminds me a lot of Avro:<p><a href="https://avro.apache.org/docs/current/" rel="nofollow">https://avro.apache.org/docs/current/</a><p>They both have self-describing schemas, support for binary values, JSON-interoperability, basic type systems (Ion seems to support a few more field types), field annotations, support for schema evolution, code generation not necessary, etc.<p>I think Avro has the additional advantages of being production-tested in many different companies, a fully-JSON schema, support for many languages, RPC baked into the spec, and solid performance numbers found across the web.<p>I can't really see why I'd prefer Ion. It looks like an excellent piece of software with plenty of tests, no doubt, but I think I could do without "clobs", "sexprs", and "symbols" at this level of representation, and it might actually be better if I do. Am I missing something?
Big congrats to Todd, Almann, Chris, Henry, and everyone else who made this happen.<p>Several years ago, I wouldn't have imagined this possible and I'm a little bummed that I left before it happened.<p>Like leef said above, I'm glad to have Ion as an option again.
Interestingly enough a JSON alternative named "ION" was just posted as a Show HN[0] about three months ago.<p>So now not only do we have the problem of redundant and mutually incompatible protocols (cue obligatory xkcd), but that we have <i>so many</i> such protocols that name collision is becoming an extra problem.<p>[0] <a href="https://news.ycombinator.com/item?id=11027319" rel="nofollow">https://news.ycombinator.com/item?id=11027319</a>
Binary values can be stored as base64 in regular old JSON as well. Yes that is bigger but same as email/MIME binary chunks are converted to base64. Email messages and attachments are handled this way, we do this everyday. Base64 does bloat by 40%ish, so the larger content could be compressed/decompressed prior to base64 encoding it and vice versa or even encrypted/decrypted on either end in software/app layer.<p>No need for a new protocol when doing it that way for basic things, if you need more binary (busy messaging/real-time) there are plenty of alternatives to JSON.<p>I love the simplicity of JSON, so do others and it is successful so many try to attach on to that success. The success part was that it was so damn simple though, most attachments just complicate and add verbosity, echoes back to XML and SOAP wars which spawned the plain and simple JSON. Adding complexity is easy and anyone can do it, good engineers take complexity and make it simple, that is damn difficult.
I can't decide if "JSON-superset" is technically accurate or not.<p>JSON's string literals come from JavaScript, and JavaScript only sortof has a Unicode string type. So the \u escape in both languages encodes a UTF-16 code <i>unit</i>, not a code <i>point</i>. That means in JSON, the single code point U+1f4a9 "Pile of Poo" is encoded thusly:<p><pre><code> "\ud83d\udca9"
</code></pre>
JSON specifically says this, too,<p><pre><code> Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
[… snip …]
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
</code></pre>
Now, Ion's spec says only:<p><pre><code> U+HHHH \uHHHH 4-digit hexadecimal Unicode code point
</code></pre>
But if we take it to mean code <i>point</i>, then if the value is a surrogate… what should happen?<p>Looking at the code, it <i>looks</i> like the above JSON will parse:<p><pre><code> 1. Main parsing of \u here:
https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L2429-L2434
2. which is called from here, and just appended to a StringBuilder:
https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L1975
</code></pre>
My Java isn't that great though, so I'm speculating. But I'm not sure what <i>should</i> happen.<p>This is just one of those things that the first time I saw it in JSON/JS… a part of my brain melted. This is all a technicality, of course, and most JSON values should work just fine.
Is there a source for benchmarks/reviews for the various ways to represent data? As far as I see it, there are a lot of them that I'd like to hear pros/cons for: json, edn + transit (my fave), yaml, google protobufs, thrift (?), as well as Ion.<p>And where does Ion fit here?
Wasn't this solved already by the BSON specification - <a href="http://bsonspec.org" rel="nofollow">http://bsonspec.org</a> ? Sure this allows you a definition of types, but this could easily be done using standard JSON meta data for each field. I find BSON simpler and more elegant.
> Decimal maintains precision: -0. != -0.0<p>What? This means their "arbitrary-precision decimals" are actually isomorphic to (Rational x Natural).
Do any of the popular message serialization formats have first class support for algebraic data types? It seems like every one I've researched has to be hacked in some way to provide for sum types.
Would like to see a comparison to EDN. <a href="https://github.com/edn-format/edn" rel="nofollow">https://github.com/edn-format/edn</a>
Almost every time I see yet another structured data format I'm surprised at the number of people who haven't ever heard of ASN.1, despite it forming the basis of <i>many</i> protocols in widespread use.
A question for frontend devs: Will H2 being binary on the wire inspire more use of binary data representations as well, with conversion to JSON only on the client? Passing around JSON or XML across a big SOA (or micro-services) architecture is a waste of cycles and doesn't have types attached for reliability and security.
This appears to be something in between of JSON and Protocol buffers. I wonder under what conditions Ion makes more sense than either of the JSON/PBuff.
So far, most of the interesting bits I see in Ion are covered in YAML (which is also JSON-superset). Most of the rest are extra types, which YAML allows you to implement. The only really missing bit is the binary encoding... but that seems unrelated to the text format itself.<p>This really looks like a NIH specification.
Open question to anyone reading this: Would you use Ion if you were designing a new house-wide message queue? (e.g. broadcast messages to /Home/Lounge/Lights/ to turn on/off)
Things I dislike about Ion, having used it while at Amazon:<p>- IonValues are mutable by default. I saw bugs where cached IonValues were accidentally changed, which is easy to do: IonSequence.extract clears the sequence [1], adding an IonValue to a container mutates the value (!) [2], etc.<p>- IonValues are not thread-safe [3]. You can call makeReadOnly() to make them immutable, but then you'll be calling clone since doing anything useful (like adding it to a list) will need to mutate the value. While it says IonValues are not even thread-safe for reading, I believe this is not strictly true. There was an internal implementation that would lazily materialize values on read, but it doesn't look like it's included in the open source version.<p>- IonStruct can have multiple fields with the same name, which means it can't implement Map. I've never seen anyone use this (mis)feature in practice, and I don't know where it would be useful.<p>- Since IonStruct can't implement Map, you don't get the Java 8 default methods like forEach, getOrDefault, etc.<p>- IonStruct doesn't implement keySet, values, spliterator, or stream, and thus doesn't play well with the Java 8 Stream API.<p>- Calling get(fieldName) on an IonStruct returns null if the field isn't present. But the value might also be there and be null, so you end up having to do a null check AND call isNullValue(). I'm not convinced it's a worthwhile distinction, and would have preferred a single way of doing it. You can already call containsKey to check for the presence of a field.<p>- In practice most code that dealt with Ion was nearly as tedious and verbose as pulling values out of an old-school JSONObject. Every project seemed to have a slightly different IonUtils class for doing mundane things like pulling values out of structs, doing all the null checks, casting, etc. There was some kind of adapter for Jackson that would allow you to deserialize to a POJO, but it didn't seem like it was widely used.<p>[1] <a href="https://github.com/amznlabs/ion-java/blob/master/src/software/amazon/ion/IonSequence.java#L457" rel="nofollow">https://github.com/amznlabs/ion-java/blob/master/src/softwar...</a><p>[2] <a href="https://github.com/amznlabs/ion-java/blob/master/src/software/amazon/ion/IonValue.java#L103-L112" rel="nofollow">https://github.com/amznlabs/ion-java/blob/master/src/softwar...</a><p>[3] <a href="https://github.com/amznlabs/ion-java/blob/master/src/software/amazon/ion/IonValue.java#L119-L140" rel="nofollow">https://github.com/amznlabs/ion-java/blob/master/src/softwar...</a>
I use this
<a href="http://dataprotocols.org/tabular-data-package/" rel="nofollow">http://dataprotocols.org/tabular-data-package/</a>