Saving cloud costs by writing our own database

211 点作者 wolframhempel大约 1 年前

40 条评论

RHSeeger大约 1 年前

> we’ve replaced our $10k/month Aurora instances with a $200/month Elastic Block Storage (EBS) volume.Without any intent to insult what you've done (because the information is interesting and the writeup is well done)... how do the numbers work out when you account for actually implementing and maintaining the database?- Developer(s) time to initially implement it- PjM/PM time to organize initial build- Developer(s) time for maintenance (fix bugs and enhancement requirements)- PjM/PM time to organize maintenanceThe cost of someone to maintain the actual "service" (independent of the development of it) is, I assume, either similar or lower, so there's probably a win there. I'm assuming you have someone on board that was on charge of making sure Aurora was configured / being used correctly; and it would be just as easier if not easier to do the same for your custom database.The cost of 120,000/year for Aurora seems like it would be less than the cost of development/organization time for the custom database.Note: It's clear you have other reasons for needing your custom database. I get that. I was just curious about the costs.

评论 #39954615 未加载

评论 #39954067 未加载

评论 #39954058 未加载

评论 #39954052 未加载

评论 #39953940 未加载

评论 #39954661 未加载

评论 #39957242 未加载

评论 #39961477 未加载

评论 #39954080 未加载

评论 #39953938 未加载

评论 #39954094 未加载

jrockway大约 1 年前

Everyone seems fixated on the word database and the engineering cost of writing one. This is a log file. You write data to the end of it. You flush it to disk whenever you've filled up some unit of storage that is efficient to write to disk. Every query is a full table scan. If you have multiple writers, this works out very nicely when you have one API server per disk; each server writes its own files (with a simple mutex gating the write out of a batch of records), and queries involve opening all the files in parallel and aggregating the result. (Map, shuffle, reduce.)Atomic: not applicable, as there are no transactions. Consistent: no, as there is no protection about losing the tail end of writes (consider "no space left on device" halfway through a record). Independent: not applicable, as there are no transactions. Durable: no, the data is buffered in memory before being written to the network (EBS is the network, not a disk).So with all of this in mind, the engineering cost is not going to be higher than $10,000 a month. It's a print statement.If it sounds like I'm being negative, I'm not. Log files are one of my favorite types of time series data storage. A for loop that reads every record is one of my favorite query plans. But this is not what things like Postgres or Aurora aim to do, they aim for things like "we need to edit past data several times per second and derive some of those edits from data that is also being edited". Now you have some complexity and a big old binary log file and some for loops isn't really going to get you there. But if you don't need those things, then you don't need those things, and you don't need to pay for them.The question you always have to ask, though, is have you reasoned about the business impacts of losing data through unhandled transactional conflicts? "read committed" or "non-durable writes" are often big customer service problems. "You deducted this bill payment twice, and now I can't pay the rent!" Does it matter to your end users? If not, you can save a lot of time and money. If it does, well, then the best-effort log file probably isn't going to be good for business.

评论 #39954247 未加载

评论 #39954350 未加载

评论 #39955859 未加载

mdaniel大约 1 年前

Anytime I hear "we need to blast in per-second measurements of ..." my mind jumps to "well, have you looked at the bazillions of timeseries databases out there?" Because the fact those payloads happen to be (time, lat, long, device_id) tuples seems immaterial to the timeseries database and can then be rolled up into whatever level of aggregation one wishes for long-term storageIt also seems that just about every open source "datadog / new relic replacement" is built on top of ClickHouse, and even they themselves allege multi-petabyte capabilities <<a href="https://news.ycombinator.com/item?id=39905443">https://news.ycombinator.com/item?id=39905443</a>>OT1H, I saw the "we did research" part of the post, and I for sure have no horse in your race of NIH, but "we write to EBS, what's the worst that can happen" strikes me as ... be sure you're comfortable with the tradeoffs you've made in order to get a catchy blog post title

评论 #39953974 未加载

评论 #39954882 未加载

评论 #39953816 未加载

评论 #39953915 未加载

yau8edq12i大约 1 年前

Wasn't this already discussed here yesterday? The main criticism of the article is that they didn't write a database, they wrote an append-only log system with limited query capabilities. Which is fine. But it's not a "database" in the sense that someone would understand when reading the title.

评论 #39954077 未加载

评论 #39954112 未加载

评论 #39954203 未加载

评论 #39954220 未加载

评论 #39954973 未加载

评论 #39959010 未加载

评论 #39954156 未加载

评论 #39954248 未加载

评论 #39954118 未加载

zX41ZdbW大约 1 年前

Sounds totally redundant to me. You can write all location updates into ClickHouse, and the problem is solved.As a demo, I've recently implemented a tool to browse 50 billion airplane locations: <a href="https://adsb.exposed/" rel="nofollow">https://adsb.exposed/</a>Disclaimer: I'm the author of ClickHouse.

评论 #39961234 未加载

MuffinFlavored大约 1 年前

> We want to be able to handle up to 30k location updates per second per node. They can be buffered before writing, leading to a much lower number of IOPS.> This storage engine is part of our server binary, so the cost for running it hasn’t changed. What has changed though, is that we’ve replaced our $10k/month Aurora instances with a $200/month Elastic Block Storage (EBS) volume. We are using Provisioned IOPS SSD (io2) with 3000 IOPS and are batching updates to one write per second per node and realm.I would be curious to hear what that "1 write per second" looks like in terms of throughput/size?

评论 #39953900 未加载

time0ut大约 1 年前

Good article.> EBS has automated backups and recovery built in and high uptime guarantees, so we don’t feel that we’ve missed out on any of the reliability guarantees that Aurora offered.It may not matter for their use case, but I don't believe this is accurate in a general sense. EBS volumes are local to an availability zone while Aurora's storage is replicated across a quorum of AZs [0]. If a region loses an AZ, the database instance can be failed over to a healthy one with little downtime. This has only happened to me a couple times over the past three years, but it was pretty seamless and things were back on track pretty fast.I didn't see anything in the article about addressing availability if there is an AZ outage. It may simply not matter or maybe they have solved for it. Could be a good topic for a follow up article.[0] <a href="https://aws.amazon.com/blogs/database/introducing-the-aurora-storage-engine/" rel="nofollow">https://aws.amazon.com/blogs/database/introducing-the-aurora...</a>

kumarm大约 1 年前

I have built similar system in 2002 using JGroups (JavaGroups at the time before open source project was acquired by JBoss) while persisting asynchronously to DB (Oracle at the time). Our scale even in 2002 was much higher than 13,000 vehicles.The project I believe still appears in success story on JGroups website after 20+ years. I am surprised people are writing their own databases for location storage in 2024 :). There was no need to invent new technology in 2002 and definitely not in 2024.

评论 #39959925 未加载

afro88大约 1 年前

These two sentences don't work together:> [We need to cater for] Delivery companies that want to be able to replay the exact seconds leading up to an accident.> We are ok with losing some data. We buffer about 1 second worth of updates before we write to diskImpressive engineering effort on it's own though!

xyst大约 1 年前

This seems like they rewrote Kafka to me.Even moderately sized Kafka clusters can handle the throughput requirement. Can even optimize for performance over durability.Some limited query capability with components such as ksqldb.Maybe offload historical data to blob storage.Then again, Kafka is kind of complicated to run at these scales. Very easy to fuck up.

评论 #39957020 未加载

评论 #39954227 未加载

the_duke大约 1 年前

I don't know what geospatial features are needed, but otherwise time series databases are great for this use case.I especially like Clickhouse, it's generic but also a powerhouse that handles most things you throw at it, handles huge write volumes (with sufficient batching), supports horizontal scaling, and offloading long-term storage to S3 for much smaller disk requirements. The geo features in clickhouse are pretty basic, but it does have some builtin geo datatypes and functions for eg calculating the distance.

kaladin_1大约 1 年前

I love the attitude, we didn't see a good fit so we rolled ours.Sure it won't cover the bazillion cases the DBs out there do but that's not what you need. The source code is small enough for any team member to jump in and debug while pushing performance in any direction you want.Cudos!

CapeTheory大约 1 年前

It's amazing what can happen when software companies start doing something approximating real engineering, rather than just sitting a UI on top of some managed services.

yunohn大约 1 年前

This is more a bespoke file format than a full blown database. It’s optimized for one table schema and a few specific queries.Not a negative though, not everything needs a general purpose database. Clearly this satisfies their requirements, which is the most important thing.

评论 #39955145 未加载

diziet大约 1 年前

As others had mentioned, probably hosting your own clickhouse instance could yield major savings while allowing for much more flexibility in the future for querying data. If your use case can be served by what clickhouse offers, gosh is it an incredibly fast and reliable open source solution that you can host yourself.

bawolff大约 1 年前

Kind of misleading to not include the cost of developing it yourself.I think everything is cheaper than cloud if you do it yourself when you don't count staffing cost.

评论 #39954044 未加载

Simon_ORourke大约 1 年前

I've no doubt this is true, however, anyone I've ever met who exclaimed "let's create our own database" would be viewed as dangerous, unprofessional or downright uneducated in any business meeting. There's just too much can go badly wrong, for all the sunk cost in getting anything up and running.

评论 #39954196 未加载

评论 #39956256 未加载

评论 #39954056 未加载

rstuart4133大约 1 年前

A lot of people here are making very confident sounding assertions, yet some as saying it's just an append only log file and some imply it's sharded. Something everyone does agree on is they are very vague about what geospartial features they need.The one thing they do say is "no ACID". That implies no b-trees, because an unexpected stop means a corrupted b-tree. Perhaps they use a hash instead, but it would have to be a damned clever hash tree implementation to avoid the same problem. Or perhaps they just rebuild the index after a crash.Even a append only log file has to be handled carefully without ACID. An uncontrolled shutdown in more file systems will at leave blocks of nulls in the file and 1/2 written blocks if they cross disk block boundaries.It's a tantalising headline, but after reading the 1,200 words I'm none the wiser on what they built or whether it meets their own specs. A bit of a disappointment.

INTPenis大约 1 年前

That is such an insane headline.You might as well say "we saved 100% of cloud costs by writing our own cloud".

endisneigh大约 1 年前

It would be interesting to see a database built from the ground up for being trivial to maintain.I use managed databases, but is there really that much to do for maintaining a database? The host requires some level of maintenance - changing disks, updating the host operating system, failover during downtime for machine repair, etc. if you use a database built for failover I imagine much of this doesn’t actually affect the operations that much assuming you slightly over provision.For a database alone I think the work needed to maintain is greatly exaggerated. That being said I still think it’s more than using a managed database, which is why my company still does so.In this case though, an append log seems pretty simple imo. Better to self host.

fifilura大约 1 年前

Would building a data lakehouse be an option?Stream the events to s3 stored as Parquet or Avro files, maybe in Iceberg format.And then use Trino/Athena to do the long term heavy lifting. Or for on-demand use cases.Then only push what you actually need live to a Aurora.

评论 #39954240 未加载

kroolik大约 1 年前

I could be missing something, but I can't really wrap my head around "unlimited paralelism".What they say is that the logic is embedded into their server binary and they write to a local EBS. But what happens when they have two servers? EBS can't be rw mounted in multiple places.Won't adding the second and more servers cause trouble like migrating data when new server joins the cluster, or a server leaves the cluster?I understand Aurora was too expensive for them. But I think it is important to note their whole setup is not HA at all (which may be fine, but the header could be misleading).

评论 #39958290 未加载

评论 #39955010 未加载

rvba大约 1 年前

> So - given that we don’t know upfront what level of granularity each customer will need, we store every single location update.Maybe Im cynical but interesting that "the business" didnt start to check it to cut costs. I know that customers love this feature. Cynically I can see it costing more, so some customers would drop in.Also it looks they rewrote a log / timeseries "database" / key value store? As pthers mention sounds like reinventing the wheel to get a cool blog post and boost career solving "problems".

rad_gruchalski大约 1 年前

> we’ve replaced our $10k/month Aurora instances with a $200/month Elastic Block Storage (EBS) volumeReminds me how I implemented mssql active-active log replication over dropbox shares back in 2010 to synchronise two databases in the Us and in the UK. Worked perfectly fine except of that one hurricane that took them out for longer than 14 days. This was more than the preconfigured log retention period.

pheatherlite大约 1 年前

How fast can reads be thou? Even if skipping along a fixed offset, reading 4 byte identifiers to filter out location updates for vehicles, that's still a sequential scan of a massive file. Wouldn't this read issue become a choking point to a degree that would make growth a curse? Then you get into weird architectures that exist solely to facilitate predigested reads?

remram大约 1 年前

They mention all those features of databases, presenting them as important:> Databases are a nightmare to write, from Atomicity, Consistency, Isolation, and Durability (ACID) requirements to sharding to fault recovery to administration - everything is hard beyond belief.Then talk about their geospatial requirements, PostGIS etc, making it seems they need geospatial features ("PostGIS for geospatial data storage" -- wtf? you need PostGIS for geospatial query not merely storage...)In reality, they did not require any of the features they mention throughout the article. What a weird write-up!I guess the conclusion is "read the F*-ing specs". Don't grab a geospatial DBMS just because you heard the words "longitude" and "database" once.

评论 #39957986 未加载

nikonyrh大约 1 年前

Very interesting, it must feel great to get to apply CS knowledge at work, rather than writing basic CRUD apis / websites.

评论 #39954306 未加载

trebecks大约 1 年前

if i'm reading the op right, they kind of use ebs as a buffer for fresh data until it ages out to s3. they use a "local" disk to hold the stuff used by the queries that people actually make and the queries run quick. they let the old stuff rot in s3 where its almost never used. that sounds like a good idea to save money plus the stuff that's done often is fast.the ebs slas look reasonable to a non expert like me and you can take snapshots. it sound like you need to be careful when snapshotting to avoid inconsistencies if stuff is only partially flushed to disk. so you'd need to pause io while it snapshots if those inconsistencies matter. that sounds bad and would encourage you to take less frequent snapshots...? you also pay for the snapshot storage but i guess you wouldn't need to keep many. i like that aws defines "SnapshotAPIUnits" to describe how you get charged for the api calls.with aurora, it looks like you can synchronously replicate to a secondary (or multiple secondaries) across azs in a single region. it sounds nice to have a sync copy of stuff that people are using. op says the'yre ok with a few seconds of data loss so i'm wondering how painful losing a volume right before taking a snapshot would be.i wonder if anything off the shelf does something similar. it sounds like people are suggesting clickhouse. i saw buffer table in their docs and it sounds similar <a href="https://clickhouse.com/docs/en/engines/table-engines/special/buffer" rel="nofollow">https://clickhouse.com/docs/en/engines/table-engines/special...</a>. it looks like it has stuff to use s3 as cold storage too. i even see geo types and functions in the docs. i've never used clickhouse so i don't know if i'm understanding what i read, but it sounds like you could do something similar to whats described in the post with clickhouse if the existing geo types + functions work and you are too lazy to roll something yourself.

loftsy大约 1 年前

Apache Cassandra could be a good fit here. Highly parallel frequent writes with some consistency loss allowed.

exabrial大约 1 年前

Why is everyone dead set on “must use aws” these days? One can cut their cloud costs by 100x with colo.And if you write your own db as they did here, it can 100% take advantage of your setup.

zinodaur大约 1 年前

Very cool! When I started reading the article I thought it was going to end up using an LSM tree/RocksDB but y'all went even more custom than that

mavili大约 1 年前

That's called engineering; you had a problem, you came up with a solution THAT WORKS for your needs. Nicely done and thanks for sharing.

selimnairb大约 1 年前

Seems like DuckDB or TileDB backed by S3 may meet your needs and be a lot cheaper than Aurora.

awinter-py大约 1 年前

we have invented write concern = 0

halayli大约 1 年前

They talk about what they store but zero mention on their retreival requirements.

tshanmu大约 1 年前

"Of course, that’s an unfair comparison, after all, Postgres is a general purpose database with an expressive query language and what we’ve built is just a cursor streaming a binary file feed with a very limited set of functionality - but then again, it’s the exact functionality we need and we didn’t lose any features."

icsa大约 1 年前

How is it possible to save more than 100% ?

评论 #39953831 未加载

评论 #39929869 未加载

评论 #39953780 未加载

评论 #39954976 未加载

评论 #39954319 未加载

bevekspldnw大约 1 年前

“We are running a cloud platform that tracks tens of thousands of people and vehicles simultaneously”…that’s not something to brag about.

评论 #39955806 未加载

brianhama大约 1 年前

Honestly, this doesn’t seem like that high of requirements. There are tens of thousands of companies that are doing more spatial data processing and are using standard cloud databases just fine.

SmellTheGlove大约 1 年前

I'm surprised to see the (mostly) critical posts. My reaction before coming to the comments was:- This is core to their platform, makes sense to fit it closely to their use cases- They didn't need most of what a full database offers - they're "just" logging- They know the tradeoffs and designed appropriately to accept those to keep costs downI'm a big believer in building on top of the solved problems in the world, but it's also completely okay to build shit. That used to be what this industry did, and now it seems to have shifted in the direction of like 5-10% of large players invent shit and open source it, and the other 90-95% are just stitching together things they didn't build in infrastructure that they don't own or operate, to produce the latest CRUD app. And hell, that's not bad either, it's pretty much my job. But it's also occasionally nice to see someone build to their spec and save a few dollars. It's a good reminder that costs matter, particularly when money isn't free and incinerating endless piles of it chasing a (successful) public exit is no longer the norm.I get the arguments that developer time isn't free, but neither is running AWS managed services, despite the name. And they didn't really build a general purpose database, they built a much simpler logger for their use case to replace a database. I'd be surprised if they hired someone additional to build this, and if they did, I'd guess (knowing absolutely nothing) that the added dev spends 80% of their time doing other things. It's not like they launched a datacenter. They just built the software and run it on cheaper AWS services versus paying AWS extra for the more complex product.

评论 #39954504 未加载