On Internal Engineering Practices at Amazon

350 pointsby wheresvic1about 6 years ago

29 comments

anon20190326about 6 years ago

Ex-Amazon SDE here.A lot of this has changed.First, there is a movement to build a lot of services in Native AWS instead of MAWS/Apollo.Apollo doesn't require copying configs anymore; you can have the config exist as part of the package you are deploying. Generally, that's a best practice.Pipelines can be configured as code too.There is a centralized log service which requires onboarding. It does require some commandline tools, but it works. The logs get stored on S3, IIRC.If containers suits your needs, you'd be hard-pressed to find someone telling you not to use it. Generally, though, you would want to use bare metal for Amazon's scale.There is also a change to how NPM is being used at Amazon. It was a lot easier towards the end of my tenure, and was probably as close as it would get when working with Amazon's build systems.Amazonians are generally conservative and don't use the latest and greatest unless it solves an actual customer need. Customer Obsession is still the defining leadership principle.

评论 #19498209 未加载

评论 #19498222 未加载

jcritesabout 6 years ago

Based on my experience, the article contains a lot of misinformation. Some of the statements might have been true at one point in the past, but are now out of date by years, while others have never been true in the time I've been around.Without getting into a point-by-point rebuttal, my reaction to each section/Exhibit is "that's wrong/misleading".

评论 #19496787 未加载

lewisjoeabout 6 years ago

I worked at a startup and then switched to a bigger product company: Zoho.If there's one thing I'd take away for the rest of my career from Zoho, it would be frugality in adopting the latest of tech.When NoSQL was all the rage, the company stood firm that relation databases had rock-solid mathematical foundation and stayed away from the bandwagon. It paid off.When every other company wrote blogs about rewriting their software in NodeJS, Ruby & Python, the company stood ground with statically typed languages. It paid off.My own team, Zoho Writer has a strong policy against incorporating third party libraries without good reasons. This way, the product is nearly a decade old, but the JS size has remained surprisingly small, all through its evolution.I believe staying frugal in adopting the latest hype, can only be reasoned about in hindsight.

评论 #19500757 未加载

ilovecachingabout 6 years ago

One misconception people get wrong, from my perspective, about most FAANG is that you get to work with new and shiny things, especially new langauges. This is really more applicable to startups where risk taking is in the DNA, you will mostly get really strong pushback because there is just too much effort involved to support more than two or three languages at scale. There are normally niche languages, but they are essentially statistical anomalies and are usually borne out of a real business need (like swift for iOS). Mostly engineering for engineering sake is also going to be frowned upon unless it really helps the business.I do agree that Amazon is the worst in regards to OSS. They really need to fix that, even if jut for PR, because they are consuming so much of it for AWS.

评论 #19498592 未加载

vonseelabout 6 years ago

As far as I know, Node.js was used inside in a limited capacity, but they had an alternative for npm for security reasons, and you had to get an npm package approved to use it internally.FWIW, this is a very good thing. A company as large as Amazon should do this with all of their repositories; even a small start-up should be doing this to mitigate suspicious packages / third-party code vulnerabilities.

评论 #19500465 未加载

评论 #19504847 未加载

评论 #19500226 未加载

评论 #19500967 未加载

评论 #19500832 未加载

raiflipabout 6 years ago

Don't want to go into too much detail but this article is like taking the crappiest parts of the crappiest systems and declaring it representative of an entire product. There is a lot of really good internal tooling not mentioned here, and for the internal tooling mentioned here (like Apollo) absolutely none of its benefits are mentioned.

tanilamaabout 6 years ago

Well, this article gets something right, but gets a lot of stuff wrong as well. What can be confirmed is that his access to the Amazon tech scene is limited, and he takes a sweeping generalization that is how the whole Amazon works.Disclaimer: Ex-Amazonian, left like one year ago.

评论 #19509096 未加载

评论 #19496878 未加载

kerngabout 6 years ago

I like how Amazon has an MAWS movement internally, meaning "Move to AWS". I think most people think that they use AWS mostly, but they dont.Its an interesting look behind the scenes at Amazon and how antiquated they appear to operate. Makes you wonder if Azure and Google have pretty good chances beating them down the road.Edit: Interesting, further down one person commented that Amazon doesn't use AWS broadly because it's seen as not secure enough for certain workloads.

评论 #19496848 未加载

评论 #19497157 未加载

ex_amazon_sdeabout 6 years ago

Ex-Amazon SDE checking in. The article is quite misleading.The author confuses "shiny" with "good".Amazon does package-based deployments because it scales well and allow engineers from many different teams to work on packages and also provides fast security updates.Amazon used VMs more than a decade before container engines and the latter are still lacking security and stability.Having worked in many companies, I would take Amazon's engineering practices over the modern shiny devops tool ecosystem every day.I agree that Apollo is slow (due to the implementation) and has an ugly UI, and that the company has a very poor track record of contributing to OSS.

cimi_about 6 years ago

Context: I spent five years as an engineer at Amazon, the last two as a tech lead on an internal developer tool (think SaaS for performance engineering).This article is not untrue but it misses the fact that teams are empowered to own their solutions are not restricted in how they setup their environments and which tools they use. It's true that fixing these problems feels like wasted effort, it's by design: Amazon operates as many separate internal entities and I think replication of effort is an acknowledged downside of operating this way.> 1. Deployments > Their internal deployment tool at Amazon is Apollo, and it doesn't support auto-scaling.I had to manually scale up my service once in two years and we weren't over-provisioning wastefully. Before I left my product was supporting +40K internal applications with an infra+AWS cost < 2k / month.We had good CI with deep integration with Apollo, you could track any change across the pipeline, we had reproducible builds and we had a comprehensive deployment log listing all changes.Apollo is sloooooow though and the UI is very 90s.> 2. Logs > Any self respecting company running software on distributed machines should have centralized, searchable logs for their services.We were using Elastic Logstash Kibana powered by AWS ElasticSearch. I wrote a thin wrapper around logstash that was used in over 1K environments internally, so weren't the only ones doing this.> 3. Service Discovery > What service discovery? We used to hard wire load balancer host names in config files.Agree with this one. I will never forget the quality time I spent configuring those load balancers and ticketing people about DNS.> 4. ContainersAs other commenters mentioned, if you want to use containers, you're free to bypass all of this and run your service in AWS where you can use ECR, EKS etc if you want.> (As far as I know, Node.js was used inside in a limited capacity, but they had an alternative for npm for security reasons, and you had to get an npm package approved to use it internally.)I built my UI from scratch using create-react-app and yarn offline builds (no mystery meat) and I bypassed all the internal JS tooling, which I thought was very poor. This was changing though.Finally, my personal anecdote: you could onboard our product in less than an hour (including reading docs), it required no further maintenance and gave you performance stats for free. So not all was bad :)

评论 #19500346 未加载

throwaway1280about 6 years ago

Ex-Amazon engineer of several years here.This is a pretty interesting article, but it's important to know that Amazon's internal tooling changes pretty fast, even if it's mostly several years behind state-of-the-art.Exhibit A: ApolloApollo used to be insane. It was designed for the use case of deploying changes to thousands of C++ CGI servers on thousands of website hosts, worrying about compiling for different architectures, supporting special fleets with overrides to certain shared libraries, etc etc. It had an entire glossary of strange terms which you needed to know in order to operate it. Deployments to our global fleet involved clicking through tens of pages, copy-and-pasting info from page to page, duplicating actions left right and centre, and hoping that you didn't forget something.When I left, most of that had been swept away and replaced with a continuous deployment tool. Do a bit of setup, commit your code to the internal Git repo, watch it be picked up, automated tests run, then deployments created to each fleet. Monitoring tools automatically rolled back deploys if certain key metrics changed.Auto scaling became a reality too, once the Move to AWS project completed. You still needed budgetary approval to up your maximum number of servers (because for our team you were talking thousands of servers per region!) but you could keep them in reserve and only deploy them as needed.Manually copying Apollo config for environment setup was still kind of a thing though. The ideas of CloudFormation hadn't quite filtered down yet.Exhibit B: logsMy memory's a bit hazy on this one. There certainly was a lot of centralized logging and monitoring infrastructure. Pretty sure that logs got pulled to a central, searchable repository after they'd existed on the hosts for a small amount of time. But, yes, for realtime viewing you'd definitely be looking at using a tool to open a bunch of terminals.The monitoring tools got a huge revamp about halfway through my tenure, gaining interactive dashboarding and metrics drill-down features which were invaluable when on-call. I'm currently implementing a monitoring system, so my appreciation for just how well that system worked is pretty high!Exhibit C: service discoveryAmusingly, a centralized service discovery tool was one of the tools that used to exist, and had fallen into disrepair by the time this person was working there.This was a common pattern in Amazon. Contrary to the 'Amazon doesn't experiment' conclusion, Amazon had a tendency to experiment too well - the Next Big Thing was constantly being released in beta, adopted by a small number of early adopters, and then disappearing for lack of funding/maintenance/headcount.I can't think of any time I hard-wired load balancer host names though. Usually they would be set up in DNS. We did used to have some custom tooling to discover our webserver hosts and automatically add/remove them from load balancers, but that was made obsolete by the auto-scaling / continuous deployment system years before I left.As for the question of "can we shut this down? who uses it?" - ha, yes, I seem to remember having that issue. I think that, before my time, it wasn't really a problem: to call a service you needed to consume its client library, so you could just look in the package manager to see which services declared that as a dependency. With the move to HTTP services that got lost. It was somewhat mitigated over the years by services moving to a fully authenticated model, with client services needing to register for access tokens to call their dependencies, but that was still a work in progress a few years ago.Exhibit D: containersAlmost everything in Amazon ran on a one-host-per-service model, with the packages present on the host dictated by Apollo's dependency resolution mechanism, so containers weren't needed to isolate multiple programs' dependencies on the same host.Screwups caused by different system binaries and libraries on different generations of host were a thing, though, and were particularly unpleasant to diagnose. Again, that mostly went away once AWS was a thing and we didn't need to hold onto our hard-won bare-metal servers.'Amazon Does Not Experiment'Amazon doesn't really do open source very well. The company is dominated by extremely twitchy lawyers. For instance, my original employment contract stated that I could not talk about any of the technology I used at my job - including which programming languages I used! Unsurprisingly, nobody paid attention to that. That meant that for many years, the company gladly consumed open source, but any question of contributing back was practically off the table as it might have risked exposing which open source projects were used internally.A small group of very motivated engineers, backed up by a lot of open-source-friendly employees, gradually changed that over the years. My first ever Amazon open source contribution took over a year to be approved. The ones I made after that were more on the order of a week.Other companies might regard open sourcing entire projects as good PR, but Amazon doesn't particularly seem to see it that way. Thus, it's not given much in the way of funding or headcount. AWS is the obvious exception, but that's because AWS's open source libraries allow people to spend more money on AWS.Instead, engineers within Amazon are pushed to generate ideas and either patent them, or make them into AWS services. The latter is good PR and money.As for different languages: it really depends on the team. I know a team who happily experimented with languages, including functional programming. But part of the reason for the pushback is that a) Amazon has an incredibly high engineer turnover, both due to expansion and also due to burnout, so you need to choose a language that new engineers can learn in a hurry, and b) you need to be prepared for your project to be taken over by another team, so it better be written in something simple. So you better have a very good justification if you want to choose something non-standard.Overall, Amazon is a pretty weird place to work as an engineer.I would definitely not recommend it to anybody whose primary motivation was to work on the newest, shiniest technologies and tooling!On the other hand, the opportunities within Amazon to work at massive scale are pretty great.One of the 'fun' consequences of Amazon's massive scale is the "we have special problems" issue. At Amazon's scale, things genuinely start breaking in weird ways. For instance, Amazon pushed so much traffic through its internal load balancers that it started running into LB software scaling issues, to the point where eventually they gave up and began developing their own load balancers! Similarly, source control systems and documentation repositories kept being introduced, becoming overloaded, then replaced with something more performant.But the problem is that "we have special problems" starts to become the default assumption, and Not Invented Here starts to creep in. Teams either don't bother searching for external software that can do what they need, or dismiss suggestions with "yeah, that won't work at Amazon scale". And because Amazon is so huge, there isn't even a lot of weight given to figuring out how other Amazon teams have solved the same problem.So you end up with each team reinventing their own particular wheel, hundreds of engineer-hours being logged building, debugging and maintaining that wheel, and burned-out engineers leaving after spending several years in a software parallel universe without any knowledge of the current industry state-of-the-art.I'm one of them. I'm just teaching myself Docker at the moment. It's pretty great.

评论 #19495961 未加载

评论 #19498246 未加载

alpbabout 6 years ago

Someone should probably add (2018) to this post as it's from May 2018.

评论 #19497221 未加载

throwaway772643about 6 years ago

I will be joining Amazon in about a month.Is there any chance I'll be able to work on OSS and/or "modern" tech (e.g. containers, Go, etc.) without a ton of push-back?It also seems Amazon is obsessed with reinventing wheels and keeping their stuff internal, which is worrying. Is there any chance to introduce solid OSS tools to the development process? (whatever they might be)

评论 #19497307 未加载

评论 #19496949 未加载

评论 #19497145 未加载

评论 #19500911 未加载

评论 #19498437 未加载

PaulHouleabout 6 years ago

Getting an npm or other package approved for internal use is not an unusual practice.

评论 #19495579 未加载

评论 #19496827 未加载

prestyabout 6 years ago

The OP needs to put a date on the article, because AFAIK things are very different in 2019.Also, it's interesting how they equate "experimentation" with "open source".

stretchwithmeabout 6 years ago

Considering the constant stream of new services and feature, the lack of OSS is insignificant compared to value they add to the world.Like the fact that you can create an SSL/TLS certificate for free for load balancers without the usual agony. So easy.

sumanthvepaabout 6 years ago

I worked at Amazon in the late 90s. So my experience is most likely not relevant anymore, but I will make a few observations. First, I see that many commenters disagree with the OP, they had a different experience of Amazon, one where they were working with infrastructure that was responsive. modern, easy to use etc. It very possible that both observations are correct. In a large company, not all parts of the company will be using the same infrastructure at the same time. Indeed, I would be dangerous for the entire company to upgrade lockstep to a new technology infrastructure. Second, in most companies, innovation is not measured by the novelty or newness of the language or framework you use, but by the business impact your product or service makes. Much of Amazon's innovation was, and is, around business models. Indeed, when I worked at AMZN, I was writing C code (to power a website) using beautifully efficient database access code written by Sheldon Kaphan. There was nothing remotely advanced about the language. It took 9 months for me to get a 3 line code change into production. And I was using technology that predated Apollo (it was called Huston) There is nothing particularly wrong about that either (it was a potty mouth filter and was blocking some obscure swear words, and no one was too worried that the component it was part of didn't ship for the best part of a year.) I now run my own company, and I both manage technology, people as well as write code. I find myself exercising the same conservatism with respect to code and infrastructure that I found at Amazon, and for the same reasons. It is expensive and potentially company destroying to switch languages and core technologies. It is best not done, or if done at all, done with a lot of care and slowly.

femto113about 6 years ago

I was once pitched a startup founded by some ex-amazonians whose big idea was "Apollo for everyone". They were nonplussed by my spit take.For the people saying "I worked there and it wasn't like that" I wonder if you worked in retail. It's a very different world from the more modern bits of the company.

just_passing_byabout 6 years ago

What amuses me is that most of retractions come from ex-Amazonians not from the current staff. This is the only company i know dealing with that much criticism from engineering.Even more to add is that the article is more or less fresh and at Amazon's scale i doubt any major changes had undergone in the last 10 months.

评论 #19499577 未加载

评论 #19498555 未加载

评论 #19501905 未加载

throwawayamz27about 6 years ago

The main benefit for amazon's tools is once you've been there a while you know how they work, and all the complexity and bugs have been stripped out of them. And because they force engineers to go oncall everyone has a pretty good idea of how to fix things.When you have SRE's spending all day creating the next new thing (generally after deprecating the previous one with no replacement), you end up in a situation where you forget how to say, rollback a bad deployment. Or scale a fleet.The problem with fancy infrastructure as code, containers and logging services is when they break you have no idea how to get out of trouble. SSH and grep almost always work, as does symlinking a directory.

jypepinabout 6 years ago

> "It's complicated, so it's gotta be good. I must be dumb to not get it."Having worked with aws a lot recently, this article doesn't surprise me at all actually. When you see the low quality of UI and documentation for most of their tools that users pay for, I wouldn't expect their internal tooling to be any better.I'm not saying their tech is bad, once things work, it works great - I'm talking about the usability of those tools as I try to use them, and it makes me feel the same as the OP's quote.

评论 #19500882 未加载

ctvoabout 6 years ago

The author is wrong about his description of tooling and best practices at Amazon.I'm also annoyed that the post isn't dated anywhere so there's no way for me to tell if it's just old.

vp8989about 6 years ago

Looking through the rest of the blog, it looks like the author worked at Amazon between 2014 and 2016.

kartikrustagiabout 6 years ago

This article is so outdated from what the current state is and has been for quite a while now.

IloveHN84about 6 years ago

To me, it sounds like a company who is still working with the mentality of 90s. The same I'm facing at work, where new technology is treated as scary and there's fear of change something that (somehow, only God knows how) is working and none is allowed to touch such things

评论 #19512244 未加载

satyajugranabout 6 years ago

I think there is no serious authenticity associated with this blog post.

anth_anmabout 6 years ago

This matches my experience using the internal tools pretty well.But hte team I was on was mostly using AWS, which meant the tooling was better.We also had a bunch of people who wanted python instead of Ruby, so they started using python and eventually we were just a team that used python. Cool.Apollo is probably the worst thing. Brazil, Pipelines, etc... those are mostly fine.Amazon has done a ton with the JVM. They don't need something new, so they don't bother. That's fine.They also do adopt other tech as needed to do things. I know people who worked in Go, because it was container stuff.This was also one of things that gave me pause when considering interviewing with snap a while back. So many former Amazon people. Seemed like a lot of stuff was going with "just copy amazon, make it a bit better". I don't want to write java. bleh.Anyway, this article is going to have a lot of people saying "oh no that's not right". It is. There are exceptions but overall it's pretty much bang on.Oh and open source. My understanding is jeff doesn't like contributing back. The company doesn't like contributing back. They keep an iron grip on IP. They are ridiculous about letting employees do side projects. They fight back with every single FOSS contribution and even after some of our senior guys did a whole bunch of work it still required multiple layers of approvals and a whole bunch of hooops and an extra training course and blah blah blah blah. It's really really dumb. I find it crazy that Amazon doesn't get more flak for being probably the worst company for open source around right now.But MS has tossed out their old CEOs, has tons of interest in making open source work well for them, and contributes loads... still get shit on.

评论 #19498015 未加载

torqueTorrentabout 6 years ago

In my career I've experienced alot of engineers that had a desire to shy away from command line or command prompt tools, shell, CMD.exe, batch, scripting, cron and related 'traditional' automation, in favor of GUI, IDE, html, browser etc.I've even had some young sexy angular-wizard type engineers that had the ear of mgmt sarcastically respond with statements like "I don't do command line".This article and my experience with AWS development and Amazon leads me to believe this entire company is led and staffed by such engineers.

mnm1about 6 years ago

It's hardly surprising. The quality of their public products isn't much better than what's described here. It's fine for companies with plenty of engineers, money, and time to base their tools off of, but that's about it. Without building a ton of extra tooling and having specialized information only available through paid support, it's almost impossible to operate anything on aws. The documentation is plentiful, mostly out of date, wrong, incomplete, and difficult to browse, search, and use. Doing devops using aws is a nightmare that never ends. Not to mention the speed of deploying anything is beyond slow, so any work takes many times the amount of time it should. For large companies with plenty of resources, these are minor points. For small and medium sized ones, it's a loss of productivity and money that simply cannot be justified over other methods.

评论 #19498310 未加载