How Shopify reduced storefront response times with a rewrite

350 点作者 vaillancourtmax将近 5 年前

16 条评论

pqdbr将近 5 年前

Some of the listed optimizations were:> We carefully vet what we eager-load depending on the type of request and we optimize towards reducing instances of N+1 queries.> Reducing Memory Allocations> Implementing Efficient Caching LayersAll of those steps seem pretty standard ways of optimizing a Rails application. I wished the article made it clearer why they decided to pursue such a complex route (the whole custom Lua/nginx routing and two applications instead of a monolith).Shopify surely has tons of Rails experts and I assume they pondered a lot before going for this unusual rewrite, so of course they have their reasons, but I really didn't understand (from the article) what they accomplished here that they couldn't have done in the Rails monolith.You don't need to ditch Rails if you just don't want to use ActiveRecord.

评论 #24229592 未加载

评论 #24229578 未加载

评论 #24229474 未加载

lazyant将近 5 年前

I didn't care especially for the technical details, what I like about this article is that the first thing they mention is the success criteria of the project (hopefully it was done at the very beginning, before any implementation). Then on top of that, they created an automated tool to verify such criteria automatically and objectively.This is a great approach and unfortunately I don't think many (most?) software projects start out like that.Not defining conditions of victory and scope creep are possibly the biggest risks in software projects.

评论 #24230723 未加载

gravypod将近 5 年前

Shopify has traditionally been an example people have pointed to for scaling a monolith with a large growth factor in all areas: team size, features, user base size, general "scale" of the company.Does anyone on here, who has worked on this project or internally at Shopify, feel that this project was successful? Do you think this is the first, of a long and gradual process, where Shopify will rewrite itself into a microservice architecture? It seems like the mentality behind this project shares a lot of commonly claimed benefits of microservices.> Over the years, we realized that the “storefront” part of Shopify is quite different from the other parts of the monolithDifferent goals that need to be solved with different architectural approaches.> storefront requests progressively became slower to compute as we saw more storefront traffic on the platform. This performance decline led to a direct impact on our merchant storefronts’ performance, where time-to-first-byte metrics from Shopify servers slowly crept up as time went onNoisy neighbors.> We learned a lot during the process of rewriting this critical piece of software. The strong foundations of this new implementation make it possible to deploy it around the world, closer to buyers everywhere, to reduce network latency involved in cross-continental networking, and we continue to explore ways to make it even faster while providing the best developer experience possible to set us up for the future.Smaller deployable units; you don't have to deploy all of shopify at edge, you only need to deploy the component that benefits from running at edge.

评论 #24233460 未加载

ww520将近 5 年前

The performance related bits:- Handcrafted SQL.- Reduce memory usage, e.g. use mutable map.- Aggressive caching with layers of caches, DB result cache, app level object cache, and HTTP cache. Some DB queries are partitioned and each partitioned result is cached in key-value store.

bdibs将近 5 年前

I’m aware that Ruby/Rails isn’t that quick, but it seems mind boggling that an 800ms server response time is considered tolerated, and 200ms is satisfying. I’ve never used Ruby in production so maybe my reference point is off and this is more impressive than I’m giving it credit for.

评论 #24233923 未加载

评论 #24232021 未加载

tehlike将近 5 年前

This is very interesting. N+1 and lazy loading have been a very common problem that profilers can spot, but eager loading also has a cartesian product problem where if you have an an entity with 6 sub item, and 100 of another subitem, you'll end up getting 600 rows to construct a single object / view model.I have been recently playing with RavenDB (from my all time favorite engineer turned CEO), it approaches most of these as an indexing problem in the database, where the view models are calculated offline as part of indexing pipeline. It approaches the problem from a very pragmatic angle. It's goal is to be a database that is very application centric.Still to be seen if we will end up adopting, but it'll be interesting to play with.Disclaimer: I am a former NHibernate contributor, and have been very intimate with AR features and other pitfalls.

评论 #24229757 未加载

aloukissas将近 5 年前

Naive question: the "storefront" piece seems like it's a static page. Why does it need SSR? Even so, it could be SSR'ed to static _once_ (kind of how NextJS does this from 9.3+), then have it served by CDN/edge. I'm probably missing something here.

评论 #24230932 未加载

评论 #24230658 未加载

kn8将近 5 年前

Is the new implementation still Rails?

评论 #24229205 未加载

评论 #24229235 未加载

评论 #24229228 未加载

评论 #24229252 未加载

hevelvarik将近 5 年前

>An example of these foundations is the decision to design the new implementation on top of an active-active replication setup. As a result, the new implementation always reads from dedicated read replicas, improving performance and reducing load on the primary writers.Could someone please explain how the ‘as a result’ follows from the active-active replication setup?

评论 #24233108 未加载

thejacenxpress将近 5 年前

Unfortunately they are still highly dependent on other APIs.When San Diego Comiccon went live on funko.com (shopify) the website was fine but the checkout was bottlenecked by the API calls to shipping providers. Many never were able to checkout and Funko had to issue an apology.Unfortunate that no matter how great you can improve your own product you may still be dependent upon others.

评论 #24231818 未加载

momonga将近 5 年前

I wish the article detailed the performance issues with the old implementation, and why those issues necessitated a rewrite (other than "strong primitives" and "difficult to retrofit").

spondyl将近 5 年前

I'd be interested to know if setting Service Level Objectives were considered as an alternative to using Apdex? Given that it's nice to be able to then calculate an error budget out of your SLO and use that to determine whether changes were impacting to the customer experience or not. Well, so the theory goes anyway. Actually doing it in practice is a whole different story ;)

switch11将近 5 年前

can anyone add to that article data onWhat users saw in terms of response timeand perceived response timeAnd what users are seeing after the improvements*We had evaluated spotify for one of our projects and aesthetically it is really good. However, time wise their store takes forever to do stuffThis was a couple of years back, so hopefully things are much better now Basically, the article covers how much better THE TEAM doing the coding feelsWhat is the effect on the users using the stores?

评论 #24233153 未加载

gadders将近 5 年前

The bit I found interesting in this is how they compare and verify that two web pages rendered by different methods "match".I wonder how you would do that? You can't hash the html. Do you take screenshots and compare?

notsureaboutpg将近 5 年前

Most commenters are focused on the optimizations made, but I actually think the custom routing and verification mechanism is the interesting bit.That kind of a tool could be handy in lots of scenarios (comparing the same service written in two different languages or with different dependencies, etc).But how does their verifier mechanism deal with changes in the production database between responses? If the response of the legacy service comes first and the response of the new service comes after, in between both responses (the request being the same) couldn't the data from the database change and thus result in the responses not passing verification when they otherwise should have? How do they manuever around that issue?Great write-up by the way! I really liked it :)

评论 #24230957 未加载

polote将近 5 年前

tldr: rewrote the backend focusing on speedWhich is good. At Reddit they would have tried to rewrite everything on reasonML and then tried to prove at the end that it is now faster