I've been working hard to up skill on the consistency and distributed systems sides of things. General recommendations:<p>- Designing Data Intensive Applicatons. Great overview of... basically everything, and every chapter has dozens of references. Can't recommend it enough.<p>- Read papers. I've had lots of a-ha moments going to wikipedia and looking up the oldest paper on a topic (wtf was in the water in Massachusetts in the 70s..). Yes they're challenging, no they're not impossible if you have a compsci undergrad equivalent level of knowledge.<p>- Try and build toy systems. Built out some small and trivial implementations of CRDTs here <a href="https://lewiscampbell.tech/sync.html" rel="nofollow noreferrer">https://lewiscampbell.tech/sync.html</a>, mainly be reading the papers. They're subtle but they're not rocket science - mere mortals can do this if they apply themselves!<p>- Follow cool people in the field. Tigerbeetle stands out to me despite sitting at the opposite end of the consistency/availability corner where I've made my nest. They really are poring over applied dist sys papers and implementing it. I joke that Joran is a dangerous man to listen to because his talks can send you down rabbit-holes and you begin to think maybe he isn't insane for writing his own storage layer..<p>- Did I mention read papers? Seriously, the research of the smartest people on planet earth are on the internet, available for your consumption, for free. Take a moment to reflect in how incredible that is. Anyone anywhere on planet earth can git gud if they apply themselves.
When I read that Google installed their own atomic clocks in each datacenter for Spanner, I knew they were doing some real computer science (and probably general relativity?) work: <a href="https://www.theverge.com/2012/11/26/3692392/google-spanner-atomic-clocks-GPS" rel="nofollow noreferrer">https://www.theverge.com/2012/11/26/3692392/google-spanner-a...</a>
I’ve built pretty much my entire career around this problem and it still feels evergreen. If you want a meaningful and rich career as a software engineer, distributed storage is a great area to focus on.
I was hoping that the blog post would actually spell out examples of problems. Is it just me or have there been a lot of shorter blog posts on HN lately that are really no more than an introduction section rather than an actual full article?
If you are interested in performance aspect of databases, I would recommend watching this great talk [0] from Alexey, ClickHouse Developer, where he talks about various aspects like designing systems by first realising the hardware capabilities and understanding the problem landscape.<p>[0]: <a href="https://www.youtube.com/watch?v=ZOZQCQEtrz8">https://www.youtube.com/watch?v=ZOZQCQEtrz8</a>
I worked at a specialty database software vendor for almost 4yrs, albiet I worked on ML connectors. I recall some of the hardest challenges as figuring out each cloud vendor's poorly documented and rapidly changing/breaking marketplace launch mechanisms (usually built atop k8s using their own flavor (eks, aks, gke, etc)).
There is so many interesting problems to solve. I just want there was available libraries or solutions that solved a lot of them for the least cost, so that I may build on some good foundations.<p>RocksDB is an example of that.<p>I am playing around with SIMD, multithreaded queues and barriers. (Not on the same problem)<p>I haven't read the DDIA book.<p>I used Michaeln Nielsen's consistent hashing code for distributing SQL database rows between shards.<p>I have an eventually consistent protocol that is not linearizable.<p>I am currently investigating how to schedule system events such as TCP ready for reading EPOLLIN or ready for writing EPOLLOUT efficiently rather than data events.<p>I want super flexible scheduling styles of control flow. Im looking at barriers right now.<p>I am thinking how to respond to events with low latency and across threads.<p>I'm playing with some coroutines in assembly by Marce Coll and looking at algebraic effects
From the article:<p>> Another example is figuring out the right tradeoffs between using local SSD disks and block-storage services (AWS EBS and others).<p>Local disks on AWS are not appropriate for long term storage, because when an instance reboot the data will be lost. AWS also doesn't offer huge amounts of local storage.
I'm kind of confused by companies like the one in the post. What is the selling point of the these hosted DB companies running in AWS, when aws and the rest of the providers them selves provide pretty good, probably much better, DB services? Is there that much money to make between running DBs on EC2 compared to the existing offerings they have?<p>Amazon, google, MS, these companies print money, have built up massive engineering cultures to run reliable storage. I just dont see what the value is with trusting data with some VC funded group over proven engineering work.<p>I worked on one of these in house storage systems, all we did was look at how the cloud providers did things already for inspiration. Might as well just use those. IDK maybe someone can convince me of the value?
> For instance, blob storages such as S3 have enabled cloud database providers to offer flexible, unlimited storage (SingleStoreDB even coined the term “bottomless storage” for this).<p>Can someone please elaborate that? What does it mean in conjunction of S3 and DB. I know how traditional DBs work (PostgreSQL and MySQL). I know how S3 work (opensource implementation like minio). But S3 is not a random access file on block storage which is a prerequirement for PostgreSQL and MySQL. How is that solved for S3 based DBs? Can someone point out to the doc, or even better an opensource implementation.
Very similar to the problems traditionally faced by engineered storage solution engineers one to two decades ago. There's a mix of those engineers and newer folks from academia or cloud in general leading the solutions for the cloud.