Well, that was very uninformative.<p>> To meet our performance expectations, Kafka must work from memory, and we don’t have much memory to give it... ...Even the smallest customer required two or three Kafka instances<p>A) What performance expectations? Timeliness?<p>B) What's "not much" memory?<p>C) And why don't you have that much?<p>D) When you say instance, do you mean "broker", or actual clusters?
Why did the smallest customer need 2 to 3 of them?
I get not wanting to add yet-another-system to reduce operational complexity but it seems more economical to use a system like Flink to do a time windowed join and emit single records to be written to a persistence store. The Flink time window can be sufficiently large to encompass the disparity between ingest and event time without much RAM consumption by using a RocksDB state backend on the operator. Let me know if I miss something, every use case is different :)
From earlier in the article:<p>> Clock skew across different sensors: Sensors might be located across different datacenters, computers, and networks, so their clocks might not be synchronized to the millisecond.<p>And later on in their final solution<p>> Implementation4 cons: Producers and consumers must have synchronized clocks (up to a certain resolution)<p>How do they reconcile this skew in their final solution?
Does any one on here have some real world experience with scylla?<p>We currently make heavy use of dynamo and are interested in something cheaper/faster. The marketing material is pretty compelling but I'm unsure of how hard scylla is to operate at scale.
If a general system becomes good enough, you see it displace specialized systems.
In this case the Kafka paradigm can be replaced because there is such a performant NoSql DB.<p>It's kind of like how standalone cameras became less and less desirable as phone cameras got better. Standalone could do better quality - but this matters less once both options are really good. There is some 'good enough' point where you hit vastly diminishing returns & simplifying into just phones became worthwhile.<p>Databases (certainly Scylla) may be hitting a point where specializing, actively optimizing, etc. are less desirable than just reusing one good system.
Im not seeing the "stream processing" piece here.<p>Looks like they went from polling an RDBMS to some triggered querying of scylla, and then on to polling scylla.<p>i.e. they went from polling an RDBMS to polling Scylla. They didnt replace kafka with anything so now their implementation isnt reactive.<p>This is effectively no different that implementing a message queue in a database, with all the negatives that brings.<p>They are sharding for each consumer to prevent multiple consumption due to lack of locks. What if a consumer goes down? How does it manage its own state? All things managed by kafka (or pretty much any MQ) out of the box, and now they have to implement ALL of that themselves - none of which is mentioned in the article.
This reads like sales copy. It's freemium FOSS-washed crippleware with a radioactive license (AGPL). Hard pass.<p>I'll stick to FOSS solutions that don't require licenses to unlock closed-source components and can be patched by a community and/or yourself.<p>Edit: Previously, there are other commercialish OSS NoSQL solutions for large-scale apps that are less proprietary with better licenses like Couchbase (not Cassandra CQL).
ScyllaDB and its related parts like Seastar always struck me as real performance-oriented programming, though it was based on leveraging language tech (C++14 early on) that was painful. I wonder if a nicer approach is possible nowadays.
Is there any more recent technical review of ScyllaDB than this?<p><a href="https://jepsen.io/analyses/scylla-4.2-rc3" rel="nofollow">https://jepsen.io/analyses/scylla-4.2-rc3</a>
So they're using Scylla to manually do what kafka does, basically processors polling for new records in a shard and updating their watermark once its done processing. I'm surprised that this is faster than just using kafka alone, though one of the reasons why they wanted to avoid kafka is dealing with deployment complexity and memory usage of kafka clusters.
>Like the first solution, normalized data is stored in a database – but in this implementation, it’s a NoSQL database instead of a relational database.<p>What means data normalization in a NoSQL context? I think most normal forms make sense in a context where we have tables, rows and Relational algebra.