Analytics 101: Choosing the right database is the wrong first step.<p>I was excited when I saw 'choosing the right data model' as one of the rules, but they are talking about the data models the DB uses internally. The important data model is choosing how you model the data you have to analyze. I have my biases, but I'd argue that a dimensional model would be a good starting if we're really at a 101 level and extensions to the model are for future classes/development.<p>Starting with the end in mind is very important. When I look at this, I think in terms of data culture. Who needs to be able to do what in your organization? What types of question will you need to answer most often? What types of questions will you need to support in an ad-hoc manner?<p>To many organizations "analytics" means arithmetic, but with complex filtering logic and business logic, traditional BI, essentially. To others, "analytics" means R code monkeys. To others it may mean specifically visualizations and the presentation layer. There are many interpretations of the word. Regardless of the interpretation, process and culture are more important to understand before the technology.<p>For a rough analogy, it's like saying "Software development 101: Choosing the right programming language". Sure it matters, but knowing what your software needs to support and what the primary use cases are are more important to understand.<p>Ninja edit: Grammar.
Surprised that RethinkDB[1] isn't mentioned. It has support for replication and sharding, plus a query language well suited for analytics.<p>(I'm not affiliated with them, just think they get proportionally little coverage given their interesting product)<p>[1] <a href="http://rethinkdb.com/" rel="nofollow">http://rethinkdb.com/</a>
Actually, you might want to not choose any database at all, but instead focus on deciding on the data format, such as Parquet (<a href="http://parquet.io" rel="nofollow">http://parquet.io</a>) or Avro (<a href="https://avro.apache.org/" rel="nofollow">https://avro.apache.org/</a>), etc. Many of the tools such as Hive, Impala, Spark, etc. support these formats natively.<p>You will also need to think about the schema, partitioning, compression and other parameters, and those are not trivial decisions.
Surprised <a href="http://druid.io/" rel="nofollow">http://druid.io/</a> wasn't mentioned. This db was made specifically for both real time analytics and batch analytics. It even has a nice front end <a href="http://imply.io/" rel="nofollow">http://imply.io/</a>
NEW: Scalable & Open Source PostgreSQL extension <a href="https://www.citusdata.com" rel="nofollow">https://www.citusdata.com</a> ( based on PG9.4 / PG9.5 )<p>Github: <a href="https://github.com/citusdata/citus" rel="nofollow">https://github.com/citusdata/citus</a><p>HN: <a href="https://news.ycombinator.com/item?id=11353322" rel="nofollow">https://news.ycombinator.com/item?id=11353322</a> "Citus Unforks from PostgreSQL, Goes Open Source (citusdata.com)" ( 24th March, 2016 )<p><i>"What is Citus?<p>- Open-source PostgreSQL extension (not a fork)<p>- Scalable across multiple hosts through sharding and replication<p>- Distributed engine for query parallelization<p>- Highly available in the face of host failures
"</i><p><i>"Citus provides users real-time responsiveness over large datasets, most commonly seen in rapidly growing event systems or with time series data . Common uses include powering real-time analytic dashboards, exploratory queries on events as they happen, session analytics, and large data set archival and reporting."</i> <a href="https://www.citusdata.com/blog/17-ozgun-erdogan/403-citus-unforks-postgresql-goes-open-source" rel="nofollow">https://www.citusdata.com/blog/17-ozgun-erdogan/403-citus-un...</a>
This is by no means a comprehensive list of databases and that's not the intent of this article. The real intent is that it's a simple read for many companies still running solely on 'general purpose databases' and showing where newer database technologies can fit in based on their data needs. Upvote.
I've always wanted a super-compact database for storing integers on smaller setups where I don't have the resources to run a dedicated logging server.<p>I can represent almost everything as an int. Like time, cpu usage, line number, etc. Even just a single byte is enough for most things like which server number or custom error was thrown.
Please, please don't set your text color to #888 or #999 on a white background.<p>I don't want to have to edit your CSS just so I can read the text.