TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Analytics 101: Choosing the right database

96 pointsby turoczyabout 9 years ago

10 comments

greggybabout 9 years ago
Analytics 101: Choosing the right database is the wrong first step.<p>I was excited when I saw &#x27;choosing the right data model&#x27; as one of the rules, but they are talking about the data models the DB uses internally. The important data model is choosing how you model the data you have to analyze. I have my biases, but I&#x27;d argue that a dimensional model would be a good starting if we&#x27;re really at a 101 level and extensions to the model are for future classes&#x2F;development.<p>Starting with the end in mind is very important. When I look at this, I think in terms of data culture. Who needs to be able to do what in your organization? What types of question will you need to answer most often? What types of questions will you need to support in an ad-hoc manner?<p>To many organizations &quot;analytics&quot; means arithmetic, but with complex filtering logic and business logic, traditional BI, essentially. To others, &quot;analytics&quot; means R code monkeys. To others it may mean specifically visualizations and the presentation layer. There are many interpretations of the word. Regardless of the interpretation, process and culture are more important to understand before the technology.<p>For a rough analogy, it&#x27;s like saying &quot;Software development 101: Choosing the right programming language&quot;. Sure it matters, but knowing what your software needs to support and what the primary use cases are are more important to understand.<p>Ninja edit: Grammar.
评论 #11361902 未加载
评论 #11362357 未加载
评论 #11361763 未加载
sandstromabout 9 years ago
Surprised that RethinkDB[1] isn&#x27;t mentioned. It has support for replication and sharding, plus a query language well suited for analytics.<p>(I&#x27;m not affiliated with them, just think they get proportionally little coverage given their interesting product)<p>[1] <a href="http:&#x2F;&#x2F;rethinkdb.com&#x2F;" rel="nofollow">http:&#x2F;&#x2F;rethinkdb.com&#x2F;</a>
评论 #11362047 未加载
评论 #11361817 未加载
评论 #11363705 未加载
评论 #11362373 未加载
gtrubetskoyabout 9 years ago
Actually, you might want to not choose any database at all, but instead focus on deciding on the data format, such as Parquet (<a href="http:&#x2F;&#x2F;parquet.io" rel="nofollow">http:&#x2F;&#x2F;parquet.io</a>) or Avro (<a href="https:&#x2F;&#x2F;avro.apache.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;avro.apache.org&#x2F;</a>), etc. Many of the tools such as Hive, Impala, Spark, etc. support these formats natively.<p>You will also need to think about the schema, partitioning, compression and other parameters, and those are not trivial decisions.
评论 #11362415 未加载
pookehabout 9 years ago
Surprised <a href="http:&#x2F;&#x2F;druid.io&#x2F;" rel="nofollow">http:&#x2F;&#x2F;druid.io&#x2F;</a> wasn&#x27;t mentioned. This db was made specifically for both real time analytics and batch analytics. It even has a nice front end <a href="http:&#x2F;&#x2F;imply.io&#x2F;" rel="nofollow">http:&#x2F;&#x2F;imply.io&#x2F;</a>
pellaabout 9 years ago
NEW: Scalable &amp; Open Source PostgreSQL extension <a href="https:&#x2F;&#x2F;www.citusdata.com" rel="nofollow">https:&#x2F;&#x2F;www.citusdata.com</a> ( based on PG9.4 &#x2F; PG9.5 )<p>Github: <a href="https:&#x2F;&#x2F;github.com&#x2F;citusdata&#x2F;citus" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;citusdata&#x2F;citus</a><p>HN: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=11353322" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=11353322</a> &quot;Citus Unforks from PostgreSQL, Goes Open Source (citusdata.com)&quot; ( 24th March, 2016 )<p><i>&quot;What is Citus?<p>- Open-source PostgreSQL extension (not a fork)<p>- Scalable across multiple hosts through sharding and replication<p>- Distributed engine for query parallelization<p>- Highly available in the face of host failures &quot;</i><p><i>&quot;Citus provides users real-time responsiveness over large datasets, most commonly seen in rapidly growing event systems or with time series data . Common uses include powering real-time analytic dashboards, exploratory queries on events as they happen, session analytics, and large data set archival and reporting.&quot;</i> <a href="https:&#x2F;&#x2F;www.citusdata.com&#x2F;blog&#x2F;17-ozgun-erdogan&#x2F;403-citus-unforks-postgresql-goes-open-source" rel="nofollow">https:&#x2F;&#x2F;www.citusdata.com&#x2F;blog&#x2F;17-ozgun-erdogan&#x2F;403-citus-un...</a>
hbcondo714about 9 years ago
This is by no means a comprehensive list of databases and that&#x27;s not the intent of this article. The real intent is that it&#x27;s a simple read for many companies still running solely on &#x27;general purpose databases&#x27; and showing where newer database technologies can fit in based on their data needs. Upvote.
cdeshpandeabout 9 years ago
What about ElasticSearch. Even though its search engine, its growing in popularity as schemaless JSON data store
评论 #11362069 未加载
评论 #11362390 未加载
评论 #11362391 未加载
Xeoncrossabout 9 years ago
I&#x27;ve always wanted a super-compact database for storing integers on smaller setups where I don&#x27;t have the resources to run a dedicated logging server.<p>I can represent almost everything as an int. Like time, cpu usage, line number, etc. Even just a single byte is enough for most things like which server number or custom error was thrown.
fweespee_chabout 9 years ago
Please, please don&#x27;t set your text color to #888 or #999 on a white background.<p>I don&#x27;t want to have to edit your CSS just so I can read the text.
评论 #11361674 未加载
评论 #11361540 未加载
paoloiamabout 9 years ago
Turoczy, thank you for sharing! Great insights.