Building and operating a pretty big storage system called S3

804 点作者 werner将近 2 年前

23 条评论

Twirrim将近 2 年前

> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that blade of grass get missed pretty frequently – and it’s actually something we need to account for in S3.One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled.Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about).

评论 #36900470 未加载

评论 #36900402 未加载

评论 #36904496 未加载

评论 #36900502 未加载

评论 #36901393 未加载

评论 #36900993 未加载

评论 #36900692 未加载

评论 #36918401 未加载

评论 #36900641 未加载

epistasis将近 2 年前

Working in genomics, I've dealt with lots of petabyte data stores over the past decade. Having used AWS S3, GCP GCS, and a raft of storage systems for collocated hardware (Ceph, Gluster, and an HP system whose name I have blocked from my memory), I have no small amount of appreciation for the effort that goes into operating these sorts of systems.And the benefits of sharing disk IOPs with untold numbers of other customers is hard to understate. I hadn't heard the term "heat" as it's used in the article but it's incredibly hard to mitigate on single system. For our co-located hardware clusters, we would have to customize the batch systems to treat IO as an allocatable resource the same as RAM or CPU in order to manage it correctly across large jobs. S3 and GCP are super expensive, but the performance can be worth it.This sort of article is some of the best of HN, IMHO.

评论 #36901517 未加载

评论 #36901280 未加载

评论 #36904224 未加载

评论 #36916537 未加载

anderspitman将近 2 年前

The things we could build if S3 specified a simple OAuth2-based protocol for delegating read/write access. The world needs an HTTP-based protocol for apps to access data on the user's behalf. Google Drive is the closest to this but it only has a single provider and other issues[0]. I'm sad remoteStorage never caught on. I really hope Solid does well but it feels too complex to me. My own take on the problem is <a href="https://gemdrive.io/" rel="nofollow noreferrer">https://gemdrive.io/</a>, but it's mostly on hold while I'm focused on other parts of the self-hosting stack.[0]: <a href="https://gdrivemusic.com/help" rel="nofollow noreferrer">https://gdrivemusic.com/help</a>

评论 #36901849 未加载

评论 #36903706 未加载

评论 #36901467 未加载

评论 #36901942 未加载

评论 #36901956 未加载

评论 #36919042 未加载

deathanatos将近 2 年前

> Now, let’s go back to that first hard drive, the IBM RAMAC from 1956. Here are some specs on that thing:> Storage Capacity: 3.75 MB> Cost: ~$9,200/terabyteThose specs can't possibly be correct. If you multiply the cost by the storage, the cost of the drive works out to 3¢.This site[1] states,> It stored about 2,000 bits of data per square inch and had a purchase price of about $10,000 per megabyteSo perhaps the specs should read $9,200 / megabyte? (Which would put the drive's cost at $34,500, which seems more plausible.)[1]: <a href="https://www.historyofinformation.com/detail.php?entryid=952" rel="nofollow noreferrer">https://www.historyofinformation.com/detail.php?entryid=952</a>

评论 #36900800 未加载

评论 #36900717 未加载

评论 #36900728 未加载

mannyv将近 2 年前

What most people don't realize is that the magic isn't in handling the system itself; the magic is making authorization appear to be zero-cost.In distributed systems authorization is incredibly difficult. At the scale of AWS it might as well be magic. AWS has a rich permissions model with changes to authorization bubbling through the infrastructure at sub-millisecond speed - while handling probably trillions of requests.This and logging/accounting for billing are the two magic pieces of AWS that I'd love to see an article about.Note that S3 does AA differently than other services, because the permissions are on the resource. I suspect that's for speed?

评论 #36902992 未加载

Narciss将近 2 年前

"As a really senior engineer in the company, of course I have strong opinions and I absolutely have a technical agenda. But If I interact with engineers by just trying to dispense ideas, it’s really hard for any of us to be successful. It’s a lot harder to get invested in an idea that you don’t own. So, when I work with teams, I’ve kind of taken the strategy that my best ideas are the ones that other people have instead of me. I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions. There are often multiple ways to solve a problem, and picking the right one is letting someone own the solution.""I learned that to really be successful in my own role, I needed to focus on articulating the problems and not the solutions, and to find ways to support strong engineering teams in really owning those solutions."I love this. Reminds me of the Ikea effect to an extent. Based on this, to get someone to be enthusiastic about what they do, you have to encourage ownership. And a great way is to have it be 'their idea'.

评论 #36902201 未加载

评论 #36902108 未加载

评论 #36902481 未加载

评论 #36902534 未加载

评论 #36902492 未加载

jl6将近 2 年前

Great to see Amazon employees being allowed to talk openly about how S3 works behind the scenes. I would love to hear more about how Glacier works. As far as I know, they have never revealed what the underlying storage medium is, leading to a lot of wild speculation (tape? offline HDDs? custom HDDs?).

评论 #36901447 未加载

评论 #36901310 未加载

评论 #36901182 未加载

评论 #36901000 未加载

评论 #36903592 未加载

评论 #36903622 未加载

评论 #36906794 未加载

评论 #36902840 未加载

评论 #36904989 未加载

评论 #36903485 未加载

评论 #36900951 未加载

评论 #36904137 未加载

评论 #36902507 未加载

dsalzman将近 2 年前

> Imagine a hard drive head as a 747 flying over a grassy field at 75 miles per hour. The air gap between the bottom of the plane and the top of the grass is two sheets of paper. Now, if we measure bits on the disk as blades of grass, the track width would be 4.6 blades of grass wide and the bit length would be one blade of grass. As the plane flew over the grass it would count blades of grass and only miss one blade for every 25 thousand times the plane circled the Earth.

评论 #36903883 未加载

jakupovic将近 2 年前

The part about distributing loads takes me back to S3 KeyMap days and me trying to migrate to it, from initial implementation. What I learned is that even after you identify the hottest objects/partitions/buckets you cannot simply move them and be done. Everything had to be sorted. The actual solution was to sort and then divide the host's partition load into quartiles and move the second quartile partitions onto the least loaded hosts. If one tried to move the hottest buckets, 1st quartile, it would put even more load on the remaining members which would fail, over and over again.Another side effect was that the error rate went from steady ~1% to days without any errors. Consequently we updated the alerts to be much stricter. This was around 2009 or so.Also came from academic background, UM, but instead of getting my PhD I joined S3. It even rhymes :).

mcapodici将近 2 年前

S3 is more than storage. It is a standard. I like how you can get S3 compatible (usually with some small caveats) storage from a few places. I am not sure how open the standards is, and if you have to pay Amazon to say you are "S3 compatible" but it is pretty cool.Examples:iDrive has E2, Digital Ocean has Object Storage, Cloudflare has R2, Vultr has Object Storage, Backblaze has B2

评论 #36901708 未加载

romantomjak将近 2 年前

Apologies if this comes off as blunt, but this is the type of content I come to read at hacker news rather than it being just a series of obituaries.The author has made a lot of great points, but one that stuck with me was:> I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions.I haven’t thought of it in this way, but this is an excellent way of motivating someone to “own” a problem.

baq将近 2 年前

> What’s interesting here, when you look at the highest-level block diagram of S3’s technical design, is the fact that AWS tends to ship its org chart. This is a phrase that’s often used in a pretty disparaging way, but in this case it’s absolutely fascinating.I’d go even further: at this scale, it is essential and required to develop these kind of projects with any sort of velocity.Large organizations ship their communication structure by design. The alternative is engineering anarchy.

评论 #36901559 未加载

评论 #36900879 未加载

评论 #36901002 未加载

评论 #36901254 未加载

评论 #36901274 未加载

whoknowswhat11将近 2 年前

Over 100 million requests per second authenticated, billed, versioned, logged, checksummed, encrypted against 200+ trillion objects.

_han将近 2 年前

The talk that this article is based on is available on YouTube: <a href="https://www.youtube.com/watch?v=sc3J4McebHE">https://www.youtube.com/watch?v=sc3J4McebHE</a>

kaycebasques将近 2 年前

> we’d read and generally have pretty lively discussions about a collection of “classic” systems research papersDoes anyone have the list of papers?> we managed to kind of “industrialize” verification, taking really cool, but kind of research-y techniques for program correctness, and get them into code where normal engineers who don’t have PhDs in formal verification can contribute to maintaining the specification, and that we could continue to apply our tools with every single commit to the softwareIs any of this open source?

g9yuayon将近 2 年前

S3 is a truly amazing piece of technology. It offers peace of mind (well, almost), zero operations, and practically unlimited bandwidth for at least analytics workload. Indeed, it's so good that there has not been much progress in building an open-source alternative to S3. There seems not much activity in the Hadoop community. I have yet heard any company who uses RADOS on Ceph to handle PBs of data for analytics workload. MinIO made its name recently, but its license is restrictive and its community is quite small compared to that of Hadoop of its hay days.

评论 #36903717 未加载

评论 #36904264 未加载

gooseyman将近 2 年前

This is a fantastic point on ownership that those “placing” it on others can often miss.“Ownership carries a lot of responsibility, but it also carries a lot of trust – because to let an individual or a team own a service, you have to give them the leeway to make their own decisions about how they are going to deliver it.”

j_not_j将近 2 年前

> It’s all one thing, and you can’t really think about it just as software. It’s > software, hardware, and people, and it’s always growing and constantly evolving.This is a lesson a lot of software people haven't yet learned. Bad UI, bad operational experiences, insufficient logging to resolve issues, un-fixable code because it's too complicated, and so on. But they use git.The other term of art for this concept is "system engineering", in the aerospace sense. There are a lot of good texts and courses.One example: Wesson: System Analysis Design and Development, Wiley, 2005. ISBN-10 0-471-39333-9

dosman33将近 2 年前

Not trying to be an arse, but the guy spent a lot more time talking about himself and other unrelated stuff than about how S3 works. And I don't mind a good article on RAMAC, but that seems... out of place in a discussion about peta-scale storage. I got the strong impression he doesn't really know the finer details of how S3 really works. And that's probably fine for what he's doing, there is plenty of room for application coding, firefighting, and problem management without having to get into the finer details of how it all works.

simonebrunozzi将近 2 年前

From 2009, a talk I gave about S3 internals [0], when I was Technology Evangelist for AWS. Still relevant today, I believe.[0]: <a href="https://vimeo.com/7330740" rel="nofollow noreferrer">https://vimeo.com/7330740</a>

nijave将近 2 年前

I think there's a good call-out about ownership here. Ownership and autonomy go hand in hand (you can't force someone to own something)

supermatt将近 2 年前

How does S3 handle particularly hot objects? Is there some form of rebalancing to account for access rates?

评论 #36906990 未加载

paulddraper将近 2 年前

> All in, S3 today is composed of hundreds of microserviceswow