TechEcho

Over the past few months I have been experimenting with building a fast, distributed, in-memory, append-only database designed for analytics. The idea being that many No-SQL databases are pretty terrible at ad-hoc querying, and while ordinary relational databases provide good support for ad-hoc queries, their performance leaves a lot to be desired for large data sets. Basically an open source Vertica / KDB.The system will accept data over HTTP (as JSON / CSV) and can be queried in either SQL, or an SQL like language, with full support for joins, sub-queries and aggregations, with output as CSV or JSON over HTTP.The idea being that you can fire of a JSON request to your analytics database whenever something happens (be it a signup, purchase, click, etc) and have that data captured, ready for use in your dashboards or for ad-hoc querying. The system will also be able to integrate with R for statistical fanciness.If such a system existed, would you use it?I will probably continue working on it regardless of any feedback (because its fun!), but I'll spend more time on it if its something people feel they might use.If you would like to develop it too (using a combination of C for data manipulation and go-lang for everythign else), send me an email.

2 comments

jknuppalmost 13 years ago

I'm not sure I understand how your design overcomes the issues of either type of database. You'll have unstructured data like a NoSQL database by virtue of your insertion mechanism, and there is no mention of how you plan to optimize for record size the orders of magnitude that would be required to keep all data in memory. As a thought exercise, say my average record takes up 1Kb. After only a few million records, with zero overhead for the database structures themselves (not to mention unrelated processes also runnin on the system) you've already exceeded the amount of memory typically used to run these types of processes.

pjinalmost 13 years ago

You didn't include an email.I've been having success with Go for backend work, and easy C integration in this kind of database like this would be useful for what I do. I'm not sure if R integration should be a priority though, since most R users I know are ambivalent about it (whatever, small sample size), and because Julia development continues to improve.

Ask HN: Interest in a new open source, fast, in-memory, data analysis system?

2 comments

Ask HN: Interest in a new open source, fast, in-memory, data analysis system?

2 comments