Protocol Buffers, Avro, Thrift & MessagePack

135 pointsby igrigorikalmost 14 years ago

11 comments

hesdeadjimalmost 14 years ago

I work at an indie game shop and I have pushed us to use it (almost) everywhere we need to define a file format. Originally we were using JSON for everything, which is great for the quick and dirty approach -- but as our code base (primarily C++) has grown I absolutely love the guarantees I get with protobufs:- Strongly typed, no boilerplate error checking if someone set my "foo" field on my object to an integer instead of a string- Easy to version and upgrade, just create new fields, deprecate the old ones, and move on with life.- Protobuf IDLs are the documentation and implementation of my file format -- no docs to write about what fields belong in what object and no issues with out of sync documentation/code.- Reflection support, don't use this a lot, but when I need it it's awesome.- Variety of storage options. For instance the level editor I wrote recently uses the human-readable text format when it saves out levels. But when I am ready to ship, I can trivially convert these level files to binary and immediately improve the performance of my app.- Tons of language bindings. Our engine code base is C++, but any build scripts I write are done in Python and if my script needs to touch protobuf files I don't have to rewrite my file parsing routines -- it just works.I looked into using Apache Thrift as well, but their text-based format is not human readable so it was a non-starter for us.

jleaderalmost 14 years ago

There's a way to encode a protobuf schema in a protobuf message, making it possible to send self-describing messages (i.e. include a serialized schema before each message). I'm not sure if anyone actually does this. See <a href="http://code.google.com/apis/protocolbuffers/docs/techniques.html#self-description" rel="nofollow">http://code.google.com/apis/protocolbuffers/docs/techniques....</a> for details.

评论 #2834726 未加载

mikeklaasalmost 14 years ago

> (Word of warning: historically, Thrift has not been consistent in their feature support and performance across all the languages, so do some research).Conversely, we chose Thrift over protobuf for this reason. Protobuf's python performance was abysmal — over 10x worse than Thrift.

评论 #2835136 未加载

评论 #2835043 未加载

habermanalmost 14 years ago

> Hence it should not be surprising that PB is strongly typed, has a separate schema file, and also requires a compilation step to output the language-specific boilerplate to read and serialize messages.I've spent the last two years working on a Protocol Buffer implementation that does not have these limitations. Using my implementation upb (<a href="https://github.com/haberman/upb/wiki" rel="nofollow">https://github.com/haberman/upb/wiki</a>) you can import schema definitions at runtime (or even define your own schema using a convenient API instead of writing a .proto file), with no loss of efficiency compared to pre-compiled solutions. (My implementation isn't quite usable yet, but I'm working on a Lua extension as we speak).I'm also working on making it easy to parse JSON into protocol buffer data structures, so that in cases where JSON is de facto typed (which I think is quite often the case) you can use Protocol Buffers as your data analysis platform even if your on-the-wire data is JSON. The benefits are greater efficiency (protobufs can be stored as structs instead of hash tables) and convenient type checking / schema validation.This question of how to represent and serialize data in an interoperable way is a path that began with XML, evolved to JSON, and IMO will converge on a mix of JSON and Protocol Buffers. Having a schema is useful: you get evidence of this from the fact that every format eventually develops a schema language to go along with it (XML Schema, JSON Schema). Protocol Buffers hit a sweet spot between simplicity and capability with its schema definition.Once you have defined a data model and a schema language, serialization formats are commodities. Protocol Buffer binary format and JSON just happen to be two formats that can both serialize trees of data that conform to a .proto file. On the wire they have different advantages/disadvantages (size, parsing speed, human readability) but once you've decoded them into data structures the differences between them can disappear.If you take this idea even farther, you can consider column-striped databases to be just another serialization format for the same data. For example, the Dremel database described by Google's paper (<a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf" rel="nofollow">http://static.googleusercontent.com/external_content/untrust...</a>) also uses Protocol Buffers as its native schema, so your core analysis code could be written to iterate over either row-major logfiles or a column-major database like Dremel without having to know the difference, because in both cases you're just dealing with Protocol Buffer objects.I think this is an extremely powerful idea, and it is the reason I have put so much work into upb. To take this one step further, I think that Protocol Buffers also represent parse trees very well: you can think of a domain-specific language as a human-friendly serialization of the parse tree for that DSL. You can really nicely model text-based protocols like HTTP as Protocol Buffer schemas:<pre><code> message HTTPRequest { enum Action { GET = 0; POST = 1; // ... } optional Action action = 1; optional string url = 2; message Header { optional string name = 1; optional string value = 2; } repeated Header header = 3; // ... } </code></pre> Everything is just trees of data structures. Protocol Buffers are just a convenient way of specifying a schema for those data structures. Parsers for both binary and text formats are just ways of turning a stream of bytes into trees of structured data.

评论 #2839733 未加载

zoowaralmost 14 years ago

This article provides a nice discussion on "protocol buffer misfeatures"<a href="http://blog.golang.org/2011/03/gobs-of-data.html" rel="nofollow">http://blog.golang.org/2011/03/gobs-of-data.html</a>

angstromalmost 14 years ago

Having used Thrift and PB I find myself partial to Avro because it's very similar to what I worked with in the past. Key points about Avro:Schema is separate from data. Network bandwidth is expensive, there is no reason to transfer the schema with each request.Data format can change allowing server to support multiple client versions in a reasonable range. Rather than using Thrift's model of always adding fields and making them Null when no longer necessary.It's possible to layer a higher layer abstraction above a lower layer request abstraction to support more complex objects without generating complex serialization.

ryanpersalmost 14 years ago

Avro is interesting but it is VERY complex. The schema negotiation is something I never felt comfortable with in an RPC context. You want something simple, and therefore bug free. This is why we like HTTP and JSON.My vote goes for Thrift, but there are some features of protobuf which are interesting, notably the APIs for dealing with messages you dont know everything about, and ways to preserve unknown fields when you copy messages from A->B, this is in part how dapper works.

rlpbalmost 14 years ago

Not requiring a schema isn't necessarily an advantage. If I'm receiving data from something that isn't in the same security domain, then I want a schema.

frsyukialmost 14 years ago

Technically, there are 2 important differences:- Statically typed or dynamically typed- Type mapping between language's type system and serializer's type system (Note: these serializers are cross-language)The most understandable difference is "statically typed" vs "dynamically typed". It affects that how to manage compatibility of data and programs. Statically typed serializers don't store detailed type information of objects into the serialized data, because it is explained in source codes or IDL. Dynamically typed serializers store type information by the side of values.- Statically typed: Protocol Buffers, Thrift, XDR- Dynamically typed: JSON, Avro, MessagePack, BSONGenerally speaking, statically typed serializers can store objects in fewer bytes. But they they can't detect errors in the IDL (=mismatch of data and IDL). They must believe IDL is correct since data don't include type information. It means statically typed serializers are high-performance but you must strongly care about compatibility of data and programs.Note that some serializers have original improvements for the problems. Protocol Buffers store some (not detailed) type information into data. Thus it can detect mismatch of IDL and data. MessagePack stores type information in effective format. Thus its data size becomes smaller than Protocol Buffers or Thrift (depends on data).Type systems are also important difference. Following list compares type systems of Protocol Buffers, Avro and MessagePack:- Protocol Buffers: int32, int64, uint32, uint64, sint32, sint64, fixed32, fixed64, sfixed32, sfixed64, double, float, bool, string, bytes, repeated, message [1]- Avro: int, long, float, double, boolean, null, float, double, bytes, fixed, string, enum, array, map, record [2]- MessagePack: Integer, Float, Boolean, Nil, Raw, Array, Map (=same as JSON) [3]Serializers must map these types into/from language's types to achieve cross-language compatibility. It means that some types supported by your favorite language can't be stored by some serializers. Or too many types may cause interoperability problems. For example, Protocol Buffers doesn't have map (dictionary) type. Avro doesn't tell unsigned integers from signed integers, while Protocol Buffers does. Avro has enum type, while Protocol Buffers and MessagePack don't have.It was necessary for their designers. Protocol Buffers are initially designed for C++ while Avro for Java. MessagePack aims interoperability with JSON.I'm using MessagePack to develop our new web service. Dynamically typed and JSON interoperability are required for us.[1] <a href="http://code.google.com/apis/protocolbuffers/docs/proto.html#scalar" rel="nofollow">http://code.google.com/apis/protocolbuffers/docs/proto.html#...</a>[2] <a href="http://avro.apache.org/docs/1.5.1/spec.html#schema_primitive" rel="nofollow">http://avro.apache.org/docs/1.5.1/spec.html#schema_primitive</a>[3] <a href="http://wiki.msgpack.org/display/MSGPACK/Format+specification" rel="nofollow">http://wiki.msgpack.org/display/MSGPACK/Format+specification</a>

评论 #2836260 未加载

评论 #2836421 未加载

famousactressalmost 14 years ago

Wasn't aware of Avro, but I often find myself wanting something in exactly that sweet-spot. Hopefully it'll be a more successful project than Caucho's Hessian has been.. which I've found pretty poorly stewarded.

equarkalmost 14 years ago

How aggressively is the Apache stack moving to Avro?

评论 #2834696 未加载

评论 #2835127 未加载