The trouble with Cassandra as an object storage metadata database

84 pointsby jasimabout 4 years ago

19 comments

kerblangabout 4 years ago

The most basic and trivial way that you're going to get burned by cassandra is that you have to divide your primary key into two parts: partition key columns, and clustering key columns. Partition keys must be in every "where" clause; only clustering keys are optional.Okay, I'll just not bother with partition keys, right? Except partition keys determine which "partition" your data goes in. So if your only partition key column is "day_of_week", then you have 7 partitions. New problem: Your partitions need to be < 300 MB or cassandra falls over dead.In fact you'll soon realize that over the lifetime of your table, keeping those partitions under control may end up forcing you to use various date tricks like putting year/month/day in as "artificial" partition keys.Of course if you put everything in the partition keys instead of clustering keys, let me note again that you have to put each partition key column in every where clause, in which case you may have trouble querying for batches of data.Furthermore, when you do put clustering columns in your where clause, they have to be in order declared; so if your CC's are a, b, and c, then your where clause can use (a), (a,b), or (a,b,c); but if it has b, it has to have a, and if it has c, it has to have a&b. This is because storage is hierarchical. (Same rules for "order by", btw) (and no there's no "group by")This is when you start realizing: Oh, you mean it's really not like "SQL without joins". No, not even close.

评论 #26250627 未加载

评论 #26250819 未加载

评论 #26251940 未加载

评论 #26252198 未加载

评论 #26251154 未加载

评论 #26252067 未加载

评论 #26253379 未加载

SMFlorisabout 4 years ago

A while back we explored the use of Cassandra. We wanted to keep some event related data there and for it to be relatively fast read-wise in order for us to do all sorts of reporting based on it. So we wrote allot and wanted to read fast. Seemed like a perfect store for our timestamped events, especially since we wanted to not even use deletes and has in-build record deduplication via its primary key. Turns out, it is not that perfect.Other than what the article described, I can also add:1. It has a steep learning curve, but you do get to see the advantages while you learn it. But then, everything comes crumbling down.2. The setup is a pain locally. Then it is a pain to set it up in prod and manage it. The tooling itself feels very unfinished and basic.3. No querying outside primary index on AWS Keyspace if you want it managed. Also, any managed variants are EXPENSIVE. I mean, every database is fast if you only query by the primary index so why pay extra?It is just not worth it. For example, we winded up using MongoDb and it turned out to be fast, scalable, had mature tooling and we can keep tons of event related metadata in it and it is easy to manage and doesn't cost a fortune.

评论 #26249688 未加载

评论 #26254066 未加载

评论 #26250109 未加载

omginternetsabout 4 years ago

This is a restatement of the Cassandra docs paired with the usual misunderstanding of CAP. An AP system is not a choice of "I'll have availability and partition tolerance, please". It's the choice of availability given a partition. The whole point of CAP is that when partitions occur, there is a forced choice -- this is why it's incorrect to ask for a CA system.The Minio team are manifestly excellent engineers, but insightless posts that contain subtle misunderstandings of CAP do nothing to showcase that competence.

评论 #26252040 未加载

eternalbanabout 4 years ago

So these guys have a storage system on top of Amazon S3 with a distributed (timestamp based) RW mutex locking, that uses NTP (github.com/beevik/ntp). That seems to be about it.<a href="https://github.com/minio/minio/blob/master/pkg/bucket/object/lock/lock.go" rel="nofollow">https://github.com/minio/minio/blob/master/pkg/bucket/object...</a><a href="https://github.com/minio/minio/blob/master/pkg/dsync/drwmutex.go" rel="nofollow">https://github.com/minio/minio/blob/master/pkg/dsync/drwmute...</a>And this is their test:<a href="https://github.com/minio/minio/blob/master/pkg/dsync/dsync_test.go" rel="nofollow">https://github.com/minio/minio/blob/master/pkg/dsync/dsync_t...</a>I just browsed quickly but it is littered with Amazon Simple Storage Service hardcoded bits like this:<a href="https://github.com/minio/minio/blob/master/pkg/bucket/object/lock/lock.go#L96" rel="nofollow">https://github.com/minio/minio/blob/master/pkg/bucket/object...</a>There is not a single document that I can find that discusses the MinIO architecture. I guess "MinIO is a simple wrapper around S3 with a homegrown distributed state tech using NTP and it is 'fast!'" does not make for a sexy doc.The column "trouble with OSS distributed DBs" without Jepsen tests has probably already been written. There should also be one about "competitor's mature product bashing blogs are HN clickbait".

snapetomabout 4 years ago

Meh. About 80% of that article discusses known limitations of Cassandra, which isn't specific to a use case of object store metadata storage. In the little that it actually does specifically talk about that use case, if your object store reflects the Cassandra limitations (flat, infrequent mutability), I don't see why Cassandra would be a bad choice.

segmondyabout 4 years ago

If you really want to see and get hints on what you can do with Cassandra, head over to Netflix tech blogs. They use Casandra extensively.

评论 #26251144 未加载

KaiserProabout 4 years ago

I would have hoped its obvious.Your metadata database needs to be the fastest and most reliable store out of everything. It can't be eventually consistent without partitioning your datastore. Even then you'll end up partitioning your data neatly into the same failure zone.Cassandra has basically one usecase: high volume writes, with a few batch reads.Cassandra is not really optimised for high reads.Most of the time postgres will do fine.

评论 #26249755 未加载

评论 #26252155 未加载

评论 #26251059 未加载

评论 #26250361 未加载

chrislusfabout 4 years ago

I am working on SeaweedFS, which supports S3 API for object store, and can also use Cassandra as the metadata db. Cassandra has been performing well for most SeaweedFS users.The article listed many known Cassandra characteristics and cited them as limitations. However, it all depends on use cases. There are no file system that works for all cases, and not all of them needs ACID, CA vs CP, etc. The rest points are not convincing either. They are related to how to design the data structure better.Actually, SeaweedFS can use many other database/KV stores as the metadata DB. The list includes Redis, Cassandra, HBase, MySql, Postgres, Etcd, ElasticSearch, etc. <a href="https://github.com/chrislusf/seaweedfs/wiki/Filer-Stores" rel="nofollow">https://github.com/chrislusf/seaweedfs/wiki/Filer-Stores</a>I did find one drawback for Cassandra as the metadata store though. One use case is that the customer uploaded a lot of zip files to one folder /tmp, unzip them, and then moved to a final folder. The rate is about 3000 files per second created and then deleted. Being a LSM structure, the tombstones quickly pile up and the directory listing was slow.The solution was to use Redis for that /tmp folder, and still use Cassandra for the rest of folders. With Redis B-tree structure, the creation and deletion are cheap.So it is all depends on use cases.

评论 #26260980 未加载

ddorian43about 4 years ago

Storing metadata together with data will just make it harder,slower to query metadata (since it will reside in hdd in most cases).You may think "it will be cached in ram, because it's small", yes, but then you'll end up querying many nodes just for metadata queries.Yes it's nicer to manage only 1 system, but in big scenarios it's probably better to separate.You can have 50+TB of nvme in 1 server, so your metadata layer probably doesn't need to horizontally scale.Imagine if you lose some objects (because you lost some replicas). You won't even know WHICH objects you lost, because the metadata is gone together with the data.Having separate, you can add a 5-replicas to metadata to be even safer compared to the usual 3 replicas.

评论 #26251692 未加载

jeff_vaderabout 4 years ago

One of my favourite tech overkills I've seen in my career is Cassandra database to store couple million records. Yup. Needless to say, it was later converted to PostgreSQL. The guy who did it is still advertising the fact on his online CV.

KingOfCodersabout 4 years ago

The one thing about Cassandra: Most probably you don't need it because you do not have the write performance needs and will not have them in the next five years. Scaling early is the death of many startups [1] Postgres with a time database will be sufficient for most needs.[1] <a href="https://www.duetpartners.com/why-is-premature-scaling-still-the-biggest-startup-killer/" rel="nofollow">https://www.duetpartners.com/why-is-premature-scaling-still-...</a>

francoisdevlinabout 4 years ago

A tangent - I love using minio in dev & test for their s3 simulators. Being able to throw away a bucket and start from scratch, and having everything self contained in my docker-compose command is a real blessing.Has anyone ever used min.io for production stuff? What are the pros & cons over vanilla s3?

评论 #26249927 未加载

评论 #26250354 未加载

评论 #26250133 未加载

评论 #26252033 未加载

评论 #26252506 未加载

AtlasBarfedabout 4 years ago

"I wanted to store something that could scale, but wanted to store and access my data in a way that didn't scale. Also, I don't understand CAP, distributed transactions, or distributed systems. I had a bad time."Somewhere else in the comments: "Yeah, we went with MongoDB."

anonymousDanabout 4 years ago

I don't quite get what they are referring to when they talk about 'metadata' here. Are they talking mostly about something internal to the database, something to do with the schema, or some kind of additional data used to enrich a particular object?

评论 #26252139 未加载

pid_0about 4 years ago

From an ops perspective, managing Cassandra is a bitch. Just use a managed service unless you have the money to hire a dedicated Cassandra expert. I’m so glad I’m done with ops

评论 #26252107 未加载

评论 #26250842 未加载

u678uabout 4 years ago

Its frustrating when it doesn't recommend something better. Every DB has tradeoffs if you choose something over Cassandra you have to give up something else.

icegreentea2about 4 years ago

What does "Bottom line. Write your metadata atomically with your object. Never separate them." mean from the perspective of picking a system?

评论 #26250669 未加载

grynabout 4 years ago

can anyone explain to me why someone would want to use cassandra when not handling internet scale stuff?the team am in uses it, but after many times asking why it was chosen since it seems a poor fit for our uses cases compared to a relational DB, the only justification I was given is a that the cluster is easier to maintain for our ops guys.

评论 #26260991 未加载

toolsliveabout 4 years ago

Yes. the problem with "eventual consistency" is that "eventually" never happens about 1 in a million times, and with millions of storage objects to manage you can't have that. So what's the alternative as a metadata store for object storage ? a consensus algorithm (paxos, raft, mencius, ...) on top of local key value stores.

评论 #26250371 未加载