Common data model mistakes made by startups

175 pointsby ReginaDeiPiratialmost 4 years ago

21 comments

rm999almost 4 years ago

>Soft deletesThis section is totally wrong IMO. What is the alternative? "Hard" deleting records from a table is usually a bad idea (unless it is for legal reasons), especially if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order. Setting a deleted/inactive flag is by far the least of two evils.>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious dragI disagree, modern analytics databases filter cheaply and easily. I have scaled data orgs 10-50x and never seen this become an issue. And if this is really an issue, you can remove these records in a transform layer before it hits your analytics team, e.g. in your data warehouse.>soft deletes introduce yet another place where different users can make different assumptionsAgain, you can transform these records out.

评论 #27251168 未加载

评论 #27249738 未加载

评论 #27249919 未加载

评论 #27249952 未加载

评论 #27251150 未加载

pyrophanealmost 4 years ago

I think the biggest mistake some startups make wrt their data model is not really thinking about it at all. The data model winds up being the byproduct of all the features they've implemented and the framework and the libraries they've used, rather than something that was deliberately designed.

评论 #27250157 未加载

评论 #27249268 未加载

评论 #27250477 未加载

评论 #27249210 未加载

评论 #27251626 未加载

评论 #27257013 未加载

rectangalmost 4 years ago

Metabase provides business analytics, and this list of "common mistakes" is weighted towards "choices which get in the way of business analytics".For example:> 1. Polluting your database with test or fake data> [...] By polluting your database with test data, you’ve introduced a tax on all analytics (and internal tool building) at your company.

评论 #27249332 未加载

评论 #27248874 未加载

handrousalmost 4 years ago

> 5. The “right database for the job” syndromeI once saw something a little similar to this, except with one flavor of DB rather than several. A company you've likely heard of went hard for a certain Java graph database product, due to a combination of an internal advocate who seemed determined to be The GraphDB Guy and an engineering manager who was weirdly susceptible to marketing material. This because some of their data could be represented as graphs, so clearly a graph database is a good idea.However: the data for most of their products was tiny, rarely written, not even read that much really, even less commonly written concurrently, and was naturally sharded (with hard boundaries) among clients. Their use of that graph database product was plainly contributing to bugginess, operational pain, mediocre performance (it was reasonably fast... as long as you didn't want to both traverse a graph and fetch data related to that graph, then it was laughably slow) and low development velocity on multiple projects.Meanwhile, the best DB to deliver the features they wanted quickly & with some nice built-in "free" features for them (ability to control access via existing file sharing tools they had, for instance) was probably... SQLite.

评论 #27253099 未加载

konhaalmost 4 years ago

> On the flip side, soft deletes require every single read query to exclude deleted records.You can use partial indexes to only index non-deleted rows. If you are worried about having to remember to exclude deleted rows from queries: Use a view to abstract away the implementation detail from your analytics queries.

评论 #27253089 未加载

ridajalmost 4 years ago

I would personally add:- Having informal metrics and dimension definitions: you throw together something quick and dirty and then realize there's something semantically broken about your data definitions or unevenness. For example your Android app and iOS apps report "countries" differently, or they have meaningfully different notions of "active users"- Not anticipating backfill/restatement needs. Bugs in logging and analytics stacks happen as much as anywhere else, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.- Being over-attentive to ratio metrics (CTR, conversion rates) which are typically difficult to diagnose (step 1 figure out whether the numerator or the denominator is the problem). Ratio metrics can be useful to rank N alternatives (eg campaign keywords) but absolute metrics are usually more useful for overall day to day monitoring.- Overlooking the usefulness of very simple basic alerting. It's common for bugs to cause a metric to go to zero, or to be double counted, or to not be updated with recent data, but often times even these highly obvious problems don't get detected until manual inspection.

评论 #27249068 未加载

bryliealmost 4 years ago

If your company has a subscription business model, keep a history of user's subscriptions. They change over time and it is likely you will need to measure popularity and profitability of product offerings over time. Please don't force your analytics team to rely on event logs to reconstruct a subscription history.

评论 #27249330 未加载

评论 #27253246 未加载

评论 #27249303 未加载

FriedrichNalmost 4 years ago

I have seen so many people argue against soft deletes over the years. But I have also had so many instances where users 'accidentally' deleted a bunch of items and then call support to ask if there are any backups. And then I'll have to reconstruct the data from yesterday's backup plus today's changes. A soft delete will take care of this.And no amount of "are you really really really sure you want to delete this?" confirmations are going to fix this. You could require the whole Spongebob Squarepants ravioli ravioli give me the formuoli song and dance and people will still delete hundreds or thousands of records by accident.

评论 #27253325 未加载

ineedasernamealmost 4 years ago

Polluting your database with test or fake dataMaybe I've been spoiled, but isn't it common to have dev, test, and prod instances? Possibly multiples of the former 2?

评论 #27254922 未加载

评论 #27251066 未加载

dugmartinalmost 4 years ago

I would add to their semi structured data fields section a suggestion to add a version or type key. Otherwise your code consuming those field may grow over time to a bunch of conditionals to figure what is in the json.

worikalmost 4 years ago

In my experience I would add: Building systems out of "lego blocks".It is possible to get all the pieces that are needed to build a data server for a enterprise pre built form cloud providers. Then plumb them together so the mostly work.When the heat comes on and peopel are using it for real and it must scale (even a little) it blows up horribly.The "leggo bricks" save a lot of time and money, and mean that people with only half a clue can build large impressive looking systems, but in the end people like ,e are picking up the pieced

nivertechalmost 4 years ago

There are advantages for soft deletes for CRUD architecture, but are there any for CQRS/ES (Event Sourcing)?I guess if your read model is based on RDBMS then it makes sense, otherwise it depends on the database system in question (i.e. some NoSQL databases like C*[1] and Riak[2] are implementing deletes by writing special tombstone values, which is kind of soft-delete but on the implementation level - but you can't easily restore the data like in case of RDBMS).[1] <a href="https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html" rel="nofollow">https://thelastpickle.com/blog/2016/07/27/about-deletes-and-...</a>[2] <a href="https://docs.riak.com/riak/kv/latest/using/reference/object-deletion/index.html" rel="nofollow">https://docs.riak.com/riak/kv/latest/using/reference/object-...</a>

jasonhanselalmost 4 years ago

> Typically semi-structured data have schemas that are only enforced by conventionTechnically, in Postgres you can (kind of) enforce arbitrary schemas for semi-structured data using CHECK constraints. Unfortunately this isn't well-documented and NoSQL DBs often don't support similar mechanisms.

评论 #27249208 未加载

评论 #27253271 未加载

jayd16almost 4 years ago

Whats the best way to construct a session?>The exact definition of what comprises a session typically changes as the app itself changes.Isn't this an argument for post-hoc reconstruction? You can consistently re-run your analytics. If the definition changes in code, your persisted data becomes inconsistent, no?

giovannibonettialmost 4 years ago

> Queries for business metrics are usually scattered, written by many people, and generally much less controlled. So do what you can to make it easy for your business to get the metrics it needs to make better decisions.A simple but useful thing is setting the database default time zone match the one where most of your team is (instead of UTC). This reduces the chance your metrics are wrong because you forgot to set the time zone when extracting the date of a timestamp.

评论 #27250773 未加载

评论 #27250750 未加载

评论 #27250794 未加载

elchiefalmost 4 years ago

Some enterprise data model links here:<a href="https://dba.stackexchange.com/questions/12991/ready-to-use-database-models-example" rel="nofollow">https://dba.stackexchange.com/questions/12991/ready-to-use-d...</a>Instead of soft deletes, move records to a history tableI agree w session issue. Had to rebuild sessions before and is a pita compared to just recording them at source

评论 #27257988 未加载

jerrysievertalmost 4 years ago

the one that is missing for me, that is my personal pet peeve:an index for every column in the database. then wondering why inserts are slow.seriously?

评论 #27252872 未加载

etermalmost 4 years ago

A more common thing I think is just trying to collect and hoard too much data.Most of even these worries such as soft deletes disappear if you're not trying to keep every scrap of data you can.Focus on the core business requirements and competencies and you likely don't need to store the minutae of every interaction forever.

cjfdalmost 4 years ago

It sounds like quite a few of the problems that are mentioned here can be ameliorated using views.

Pxtlalmost 4 years ago

How do you reconcile the first bullet point (polluting data with test data) vs Test In Production being the modern trend? Those sound irreconcilable.

评论 #27252208 未加载

评论 #27249252 未加载

评论 #27252619 未加载

intricatedetailalmost 4 years ago

I am happy that on so many projects we rejected the kool aid and just used postgres and redis. Can't remember ever troubleshooting these.