I'm in the process of doing something like this internally, at a smaller scale, and it's interesting to see that many of the concepts I've been experimenting with and playing around with are formalized here in a similar manner. My "solution" doesn't build on Spark, as I just don't have enough data to necessitate it. I think the big difference is really the project's SQL first approach, which is probably going to polarize: personally, it's a decision I can't abide by, but I'm sure a lot of people will love that.