This looks like a very nice implementation of a graph database. However a 6-machine cluster barely qualifies as "distributed" for the purposes of a graph database. You will experience almost no sub-linearity at this scale no matter how poorly the algorithms distribute. I am not intimately familiar with the transactional implementation but the graph processing algorithms described in Titan are not significantly distributable.<p>For graph queries, linearity can be definitely achieved on parallel systems with <i>tens of thousands</i> of compute nodes using the state-of-the-art algorithms. However, Titan does not use any of those algorithms and will be very lucky to exhibit linearity at a couple dozen nodes. Not to take away from the implementation but people looking for a scalable graph database will not be satisfied with Titan.<p>As an aside, the best algorithms and data structures for doing massively parallel graph operations are not published but obviously exist. The fastest system in the Graph500 benchmark uses a half million cores on a trillion edge graph. That is a several order of magnitude gap between what the best open source systems can do and what systems developed by closed research organizations can do as published in public benchmarks.<p>(Disclosure: I invented one of the aforementioned families of massively parallel graph processing algorithms in 2009, and not the first. The published literature has not even caught up with the first generation of such algorithms. A significant portion of advanced computer science research is no longer published.)
Great article.<p>I'd like to see a few other details that aren't mentioned:<p>- What's the following distribution end up looking like? Does it have a similar fraction of 'celebrity' users with huge follower counts? Or more technically, does the russian roulette against the recommendation sampling end up producing a network similar to a scale free graph grown via preferential attachment (Barabási–Albert model)? It looks like your mean fanout is about 20, which is smaller than what twitter has published, but I'd be more interested in knowing how many 10k+ follower users are in the graph.<p>- What's the write amplification like? ~1.6 Billion per tweet per follower edges stored daily seems like it could burn a lot of capacity quickly, though most of it will grow cold quickly and could be pushed to archive. Making a rough guess from your disk write monitoring line graph, it looks like you'd be putting down about 16GB a day? It'd be interesting to see a comparison between this run and one done where streams are built indirectly via follower links alone.
Right now Titan traversals run locally and communicate with the back-end Cassandra or HBase datastore via Thrift, but they are working on moving the traversal code inside the Cassandra/HBase JVM so you can run Gremlin traversals inside the distributed cluster rather than than pulling the elements over the wire.<p>See <a href="https://groups.google.com/d/msg/gremlin-users/6GYiHn3QR8g/81TFiW7yoTsJ" rel="nofollow">https://groups.google.com/d/msg/gremlin-users/6GYiHn3QR8g/81...</a>
Cassandra is turning out to be one killer data backend. It's really exciting to see what's being built on top of it.<p>Hadoop, Solr/Lucene, and now Blueprints/Grapb DB operations are all available on the same Cassandra cluster, in addition to the stuff Cassandra does quote-unquote natively. Add Zookeeper for the few times you need an honest-to-goodness transaction and it's just crazy how good the tech has gotten on the backend in the last 10 years. :-)