While cool, and interesting, this strikes me as the exact type of overengineering tech-heavy startups kept being told to stop doing. I get the argument that it'll "give you a platform to build on" and help you "move fast later," but how much did you spend on engineering a storage library that doesn't solve a critical business need? $10k? $20k? How many people spent hours and hours on building another Clojure-based abstraction layer over key-value stores?<p>I also don't buy the argument that it'll help you be really "scalable". Once your datasets no longer fit in memory, it's all custom. Hadoop is a great solution if you have infinite money. Data locality matters. All those different services may expose a consistent-enough interface that you can build a common abstraction over them; their latency and reliability properties, failure modes, consistency, etc are not homogeneous, and at scale you can't pretend that they are. If your data fits in memory today (and that means 144g - 192g per machine), then why are you worrying about this? You'll need to rewrite a huge part of your infrastructure to scale from 100k users to 10M users anyway.<p>TL;DR: solve the problem first and abstract out the framework later. Also "scalability" is incredibly expensive (and unnecessary) to engineer for a priori.
This is your second post, and you've already become one of my favorites. Have a nlp question for you. I noticed you're doing: {"Prismatic" {"runs" 1},"runs" {"on" 1},
"on" {"coffee" 1}}
Why do you care to have a count of each word that follows and not assume its always one. Why not just analyze a simple bi-gram as key and an int to keep the count as value?