The broad set of computer science problems faced at cloud database companies

240 pointsby munchorover 1 year ago

11 comments

LAC-Techover 1 year ago

I've been working hard to up skill on the consistency and distributed systems sides of things. General recommendations:- Designing Data Intensive Applicatons. Great overview of... basically everything, and every chapter has dozens of references. Can't recommend it enough.- Read papers. I've had lots of a-ha moments going to wikipedia and looking up the oldest paper on a topic (wtf was in the water in Massachusetts in the 70s..). Yes they're challenging, no they're not impossible if you have a compsci undergrad equivalent level of knowledge.- Try and build toy systems. Built out some small and trivial implementations of CRDTs here <a href="https://lewiscampbell.tech/sync.html" rel="nofollow noreferrer">https://lewiscampbell.tech/sync.html</a>, mainly be reading the papers. They're subtle but they're not rocket science - mere mortals can do this if they apply themselves!- Follow cool people in the field. Tigerbeetle stands out to me despite sitting at the opposite end of the consistency/availability corner where I've made my nest. They really are poring over applied dist sys papers and implementing it. I joke that Joran is a dangerous man to listen to because his talks can send you down rabbit-holes and you begin to think maybe he isn't insane for writing his own storage layer..- Did I mention read papers? Seriously, the research of the smartest people on planet earth are on the internet, available for your consumption, for free. Take a moment to reflect in how incredible that is. Anyone anywhere on planet earth can git gud if they apply themselves.

评论 #37217127 未加载

评论 #37217969 未加载

评论 #37217416 未加载

评论 #37219752 未加载

评论 #37219479 未加载

评论 #37220424 未加载

xnxover 1 year ago

When I read that Google installed their own atomic clocks in each datacenter for Spanner, I knew they were doing some real computer science (and probably general relativity?) work: <a href="https://www.theverge.com/2012/11/26/3692392/google-spanner-atomic-clocks-GPS" rel="nofollow noreferrer">https://www.theverge.com/2012/11/26/3692392/google-spanner-a...</a>

评论 #37217417 未加载

评论 #37218869 未加载

评论 #37216924 未加载

评论 #37218986 未加载

richieartoulover 1 year ago

I’ve built pretty much my entire career around this problem and it still feels evergreen. If you want a meaningful and rich career as a software engineer, distributed storage is a great area to focus on.

评论 #37222681 未加载

评论 #37219029 未加载

ibgeekover 1 year ago

I was hoping that the blog post would actually spell out examples of problems. Is it just me or have there been a lot of shorter blog posts on HN lately that are really no more than an introduction section rather than an actual full article?

pradeepchhetriover 1 year ago

If you are interested in performance aspect of databases, I would recommend watching this great talk [0] from Alexey, ClickHouse Developer, where he talks about various aspects like designing systems by first realising the hardware capabilities and understanding the problem landscape.[0]: <a href="https://www.youtube.com/watch?v=ZOZQCQEtrz8">https://www.youtube.com/watch?v=ZOZQCQEtrz8</a>

TuringNYCover 1 year ago

I worked at a specialty database software vendor for almost 4yrs, albiet I worked on ML connectors. I recall some of the hardest challenges as figuring out each cloud vendor's poorly documented and rapidly changing/breaking marketplace launch mechanisms (usually built atop k8s using their own flavor (eks, aks, gke, etc)).

samsquireover 1 year ago

There is so many interesting problems to solve. I just want there was available libraries or solutions that solved a lot of them for the least cost, so that I may build on some good foundations.RocksDB is an example of that.I am playing around with SIMD, multithreaded queues and barriers. (Not on the same problem)I haven't read the DDIA book.I used Michaeln Nielsen's consistent hashing code for distributing SQL database rows between shards.I have an eventually consistent protocol that is not linearizable.I am currently investigating how to schedule system events such as TCP ready for reading EPOLLIN or ready for writing EPOLLOUT efficiently rather than data events.I want super flexible scheduling styles of control flow. Im looking at barriers right now.I am thinking how to respond to events with low latency and across threads.I'm playing with some coroutines in assembly by Marce Coll and looking at algebraic effects

avrionovover 1 year ago

From the article:> Another example is figuring out the right tradeoffs between using local SSD disks and block-storage services (AWS EBS and others).Local disks on AWS are not appropriate for long term storage, because when an instance reboot the data will be lost. AWS also doesn't offer huge amounts of local storage.

评论 #37220045 未加载

评论 #37218520 未加载

tayo42over 1 year ago

I'm kind of confused by companies like the one in the post. What is the selling point of the these hosted DB companies running in AWS, when aws and the rest of the providers them selves provide pretty good, probably much better, DB services? Is there that much money to make between running DBs on EC2 compared to the existing offerings they have?Amazon, google, MS, these companies print money, have built up massive engineering cultures to run reliable storage. I just dont see what the value is with trusting data with some VC funded group over proven engineering work.I worked on one of these in house storage systems, all we did was look at how the cloud providers did things already for inspiration. Might as well just use those. IDK maybe someone can convince me of the value?

评论 #37219501 未加载

评论 #37218855 未加载

评论 #37221216 未加载

betabyover 1 year ago

> For instance, blob storages such as S3 have enabled cloud database providers to offer flexible, unlimited storage (SingleStoreDB even coined the term “bottomless storage” for this).Can someone please elaborate that? What does it mean in conjunction of S3 and DB. I know how traditional DBs work (PostgreSQL and MySQL). I know how S3 work (opensource implementation like minio). But S3 is not a random access file on block storage which is a prerequirement for PostgreSQL and MySQL. How is that solved for S3 based DBs? Can someone point out to the doc, or even better an opensource implementation.

评论 #37218477 未加载

评论 #37220261 未加载

评论 #37221185 未加载

评论 #37217958 未加载

评论 #37218246 未加载

评论 #37218334 未加载

jollyllamaover 1 year ago

Very similar to the problems traditionally faced by engineered storage solution engineers one to two decades ago. There's a mix of those engineers and newer folks from academia or cloud in general leading the solutions for the cloud.