From what I understand, Manhattan is based on the ideas from ElephantDB. Unfortunately, development has pretty much stopped on ElephantDB despite the fact a book by Nathan Marz is being written about big data that is dependent on it. <a href="http://www.manning.com/marz/" rel="nofollow">http://www.manning.com/marz/</a><p>Summingbird (bear with me, I'll tie this in) is also twitter's answer for writing code once and seeing it run on a variety of execution platforms such as hadoop, storm, spark, akka, etc... Not all of these have been built out, but the platform was designed to be a generic framework to support write once execute everywhere.<p>Summingbird is written to support Manhattan's model as well. The high level idea is to use versioning to determine whether a request is precomputed (batch), computed (realtime) or a hybrid (precomputed + computed). These are expressed as monads with basic functionality present in algebird. One way to bring support to this model to the open source world would be to implement storehaus bindings for elephantdb and to resurrect elephantdb or build a similar service to provide storage similar to Manhattan.<p>Overall, very early yet promising work in the open source community.<p>[edit: book is not about elephantdb, but is a critical component. modified wording. Also added link]
I think its not an exaggeration at all. Twitter literally pulls data from thousands of servers but we are missing the point of Manhattan. As some of us know twitter services scale dynamic according the load they are serving and some engineers at Twitter decide to buy a hole truck of steroids to this idea. Here is a trick - We make container agents (storage services) "clients" of the Manhattan database(the core). They are Mesos processes which scale dynamic to the needs of the service (i.e 1 container = 10000 reads/writes per second, 2 containers = 20000 read/writes per second and so on) which allows the dynamic scaling of requests per second and writes per second. The core handles finding actual machines which have the data, replicating it and so on. There might be realtime storage service contaners which need fast data access, batch importer and timeseries they mention and so on. This requires a lot of gymnastics but offer a lot of nice features.The Manhattan database acts as virtual layer over thousands of machines and storage services allows for customized data manipulation. Cool...huh?
According importance and scale (multi dc) this operates on i think it almost impossible to open source this. But who knows. miracles are happening now and then.
Interesting.. The database sounds almost too good to be true. I wonder if they'll open source this. They've done so in the past with projects like Storm, so I'm hopeful.
Can someone enlighten me as to why 6000 tweets a second is something to make a big deal about? At 140 characters per message that comes out to 840,000 bytes/s < 1 Megabytes per second. In 2014 is a service that can handle 1 Megabytes/s impressive?
"Real-time" is a bit of a misnomer as far as databases are concerned, especially if you're talking about a system that defaults to eventual consistency. I think they would have been better off saying "high-availability".
Does anyone else find the opening statement a little misleading? Yes they originally come from one place, but they are sent from Twitter to the app of your choosing via JSON or similar. Sure there's going to be more than one request for the icon sprite and user avatars, but all from Twitter.<p>"When you open the Twitter app on your smartphone and all those tweets, links, icons, photos, and videos materialize in front of you, they’re not coming from one place. They’re coming from thousands of places."
So this is an internal system right? What is the point of telling the world if it's not available for anyone to look at or use. Perhaps a recruiting exercise?