I'm most curious about how the experience was, both for the atlas team and the product team, around this quote, "Atlas is “managed,” which means that developers writing code in Atlas only need to write the interface and implementation of their endpoints. Atlas then takes care of creating a production cluster to serve these endpoints. The Atlas team owns pushing to and monitoring these clusters."<p>Does this imply that the atlas team gets into the weeds of understanding the business and business logic behind these endpoints to know the scalability and throughput needs? Is the autoscaler really good enough to handle this? If it's transparent to the product team, are they aware of their usage (potentially unexpected)? I imagine the atlas team would have to be very large with these sorts of responsibilities.<p>From a product team perspective I imagine they are still responsible for database configuration and tuning? Has the daily auto-deployment led to unexpected breaks? Who is responsible for rollbacks? And is the product team responsible and capable of hotfixes?<p>Maybe a more broad question which all of my questions above speak to: how are the roles and responsibilities set up between the atlas team and the product engineering team that owns the code, and how has the transition to that system been?
So what's the end game here, is Dropbox going to keep building out an internal Kubernetes-like platform with Atlas, or do they plan to eventually just move to k8s? I noticed this line in particular:<p>"We evaluated using off-the-shelf solutions to run the platform. But in order to de-risk our migration and ensure low engineering costs, it made sense for us to continue hosting services on the same deployment orchestration platform used by the rest of Dropbox."<p>It sounds like they acknowledge they're reinventing a lot of stuff but for now are sticking to their internal platform. Perhaps Atlas is a half-step then to get teams used to owning and running their code as isolated services. But everything I read that they built in Atlas--isolated orchestrated services, gRPC load balancing, canary deployments, horizontal scaling, etc.--are bog standard features of Kubernetes today. I'd be very leery of maintaining a bespoke Kubernetes-like platform in 2021 and beyond--in some ways it seems like it's just shifting the monolith technical debt into an internal Atlas platform team's technical debt. What's the plan to get rid of that debt for good I wonder?<p>This hurdle shows there's already some cracks in the idea of long-term Atlas too:<p>"While splitting up Metaserver had wins in production, it was infeasible to spin up 200+ Python processes in our integration testing framework. We decided to merge the processes back into a monolith for local development and testing purposes. We also built heavy integration with our Bazel rules, so that the merging happens behind the scene and developers can reference Atlasservlets as regular services."<p>If I read that right does it really mean the first time a developer's code is run like it will run in production is when it goes out to canary deployment? I.e. integration tests are done in a local monolith instead of setting up a mini-prod cluster. That seems a bit nerve-racking as a dev to have no way to really test the service until bits are hitting user requests. In the k8s world a ton of work has been put into tooling and processes to make setting up local clusters easily. It's a shame to not have something similar for Atlas.
I'd be interested to better understand the timeline around this statement:<p>"Metaserver was stuck on a deprecated legacy framework that unsurprisingly had poor performance and caused maintenance headaches due to esoteric bugs. For example, the legacy framework only supports HTTP/1.0 while modern libraries have moved to HTTP/1.1 as the minimum version."<p>Dropbox has been around for a lot of years, and raised a lot of cash; was it only recently that they could pay down this technical debt? Were they really so busy in other areas that this was allowed to fester?
> Every line of code they wrote was, whether they wanted or not, shared code—they didn’t get to choose what was smart to share, and what was best to keep isolated to a single endpoint.<p>I am know very little about Python, but does this mean that Python has no way to encapsulate code at a level larger than a class? Something like a package or a module. It does not seem like it should be necessary to break a system into separate services just to get encapsulation at a module of subsystem level.
Headline of article: "Our journey from a Python monolith to a managed platform"<p>So... about this headline. I read this aloud to a friend at a cafe. We laughed. It makes perfect sense to us. We know what Python is. We know what a monolith means in this context.<p>To my other friends it was the funniest / silliest / nonsensical thing they'd heard for awhile.<p>IT is weird.<p>(ps I know no one will see this comment but I'll leave it here. Because.)
I just want to know when they'll switch to native battery efficient clients, especially given the daemon is always running and monitoring file system events.
Semi-related: when people talk about monorepos, is it implied that all the project has only one version number? Why not just version subprojects of the monorepo, that way you have a small vetting process when cutting a release of a specific subproject. The rest of the subprojects that depend on it can read the release notes for breaking changes, etc.