Node.js in Flame Graphs

778 点作者 stoey超过 10 年前

41 条评论

ChuckMcM超过 10 年前

The moneyquote:"We made incorrect assumptions about the Express.js API without digging further into its code base. As a result, our misuse of the Express.js API was the ultimate root cause of our performance issue."This situation is my biggest challenge with software these days. The advice to "just use FooMumbleAPI!" is rampant and yet the quality of the implemented APIs and the amount of review they have had varies all over the map. Consequently any decision to use such an API seems to require one first read and review the entire implementation of the API, otherwise you get the experience that NetFlix had. That is made worse by good APIs where you spend all that time reviewing them only to note they are well written, but each version which could have not so clued in people committing changes might need another review. So you can't just leave it there. And when you find the 'bad' ones, you can send a note to the project (which can respond anywhere from "great, thanks for the review!" to "if you don't like it why not send us a pull request with what you think is a better version.")What this means in practice is that companies that use open source extensively in their operation, become slower and slower to innovate as they are carrying the weight of a thousand different systems of checks on code quality and robustness, which people using closed source will start delivering faster and faster as they effectively partition the review/quality question to the person selling them the software and they focus on their product innovation.There was an interesting, if unwitting, simulation of this going on inside Google when I left, where people could check-in changes to the code base that would have huge impacts across the company causing other projects to slow to a halt (in terms of their own goals) while they ported to the new way of doing things. In this future world changes, like the recently hotly debated systemd change, will incur costs while the users of the systems stop to re-implement in the new context, and there isn't anything to prevent them from paying this cost again and again. A particularly Machievellan proprietary source vendor might fund programmers to create disruptive changes to expressly inflict such costs on their non-customers.I know, too tin hat, but it is what I see coming.

评论 #8632668 未加载

评论 #8632472 未加载

评论 #8632747 未加载

评论 #8632564 未加载

评论 #8632443 未加载

评论 #8633223 未加载

评论 #8632474 未加载

评论 #8632574 未加载

评论 #8632483 未加载

评论 #8632355 未加载

评论 #8633774 未加载

评论 #8632328 未加载

评论 #8635054 未加载

thedufer超过 10 年前

> It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers.Its actually quite clear - most routes are defined by a regex rather than a string, so there is no built-in structure (if there's a way at all) to do O(1) lookups in the routing table. A router that only allowed string route definitions would be faster but far less useful.I can't explain away the recursion, though. That seems wholly unnecessary.Edit: Actually, I figured that out, too. You can put middleware in a router so it only runs on certain URL patterns. The only difference between a normal route handler and a middleware function is that a middleware function uses the third argument (an optional callback) and calls it when done to allow the route matcher to continue through the routes array. This can be asynchronous (thus the callback), so the router has to recurse through the routes array instead of looping.

评论 #8631268 未加载

评论 #8631370 未加载

评论 #8631284 未加载

评论 #8631869 未加载

评论 #8631288 未加载

评论 #8632009 未加载

评论 #8631902 未加载

评论 #8631713 未加载

评论 #8631240 未加载

rwaldin超过 10 年前

I'm surprised nobody has mentioned that express has a built in mechanism for sublinear matching against the entire list of application routes. All you have to do is nest Routers (<a href="http://expressjs.com/4x/api.html#router" rel="nofollow">http://expressjs.com/4x/api.html#router</a>) based on URL path steps and you will reduce the overall complexity of matching a particular route from O(n) to near O(log n).

remon超过 10 年前

I wonder what the thought process was behind moving their web service stack (partially?) to node.js in the first place. For a company with the scale and resources of Netflix it's not exactly an obvious choice.

评论 #8631343 未加载

评论 #8631422 未加载

评论 #8631294 未加载

评论 #8634213 未加载

elwell超过 10 年前

TIL, SVG's can display labels on element hover: <a href="http://cdn.nflximg.com/ffe/siteui/blog/yunong/200mins.svg" rel="nofollow">http://cdn.nflximg.com/ffe/siteui/blog/yunong/200mins.svg</a>Nice, contained way to show data like this.

评论 #8632544 未加载

评论 #8632667 未加载

vkjv超过 10 年前

> ...as well as increasing the Node.js heap size to 32Gb.> ...also saw that the process’s heap size stayed fairly constant at around 1.2 Gb.This is because 1.2 GB is the max allowed heap size in v8. Increasing beyond this value has no effect.> ...It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers.It it is non-trivial (not possible?) to do this in O(1) for routes that use matching / wildcards, etc. This optimization would only be possible for simple routes.

评论 #8631179 未加载

评论 #8631436 未加载

评论 #8631217 未加载

tjholowaychuk超过 10 年前

Sounds like a documentation issue, or lack of a staging environment. I've written and maintained countless large Express applications and routing was never even remotely a bottleneck, thus the simple & flexible linear lookup. I believe we had an issue or two open for quite a while in case anyone wanted to report real use-cases that performed poorly.Possibly worth mentioning, but there's really nothing stopping people from adding dtrace support to Express, it could easily be done with middleware. Switching frameworks seems a little heavy-handed for something that could have been a 20 minute npm module.

_Marak_超过 10 年前

I read:"This turned out be caused by a periodic (10/hour) function in our code. The main purpose of this was to refresh our route handlers from an external source. This was implemented by deleting old handlers and adding new ones to the array"refresh our route handlers from an external sourceThis is not something that should be done in live process. If you are updating the state of the node, you should be creating a new node and killing the old one.Aside from hitting a somewhat obvious behavior for messing with the state of express in running process, once you have introduced the idea of programmatically putting state into your running node you have seriously impeded the abiltity to create a stateless fault tolerant distributed system.

评论 #8634554 未加载

TheLoneWolfling超过 10 年前

> benchmarking revealed merely iterating through each of these handler instances cost about 1 ms of CPU time1ms / entry? What is it doing that it's spending 3 million cycles on a single path check?

评论 #8631452 未加载

clebio超过 10 年前

> I can’t imagine how we would have solved this problem without being able to sample Node.js stacks and visualize them with flame graphs.This has me scratching my head. The diagrams are pretty, maybe, but I can't read the process calls from them (the words are truncated because the graphs are too narrow). And I can't see, visually, which calls are repeated. They're stacked, not grouped, and the color palette is quite narrow (color brewer might help here?).At least, I _can_ imagine how you could characterize this problem without novel eye-candy. Use histograms. Count repeated calls to each method and sort descending. Sampling is only necessary if you've got -- really, truly, got -- big data (which Netflix probably does), but I don't think the author means 'sample' in a statistical sense. It sounds more like 'instrumentation', decorating the function calls to produce additional debugging information. Either way, once you have that, there are various common ways to isolate performance bottlenecks. Few of which probably require visual graphs.There's also various lesser inefficiencies in the flame graphs: is it useful (non-obvious) that every call is a child of `node`, `node::Start`, `uv_run`, etc.? Vertical real-estate might be put to better use with a log-scale? Etcetera, etc.

评论 #8635027 未加载

评论 #8635107 未加载

drderidder超过 10 年前

<pre><code> > our misuse of the Express.js API was the > ultimate root cause of our performance issue </code></pre> That's unfortunate. Restify is a nice framework too, but mistakes can be made with any of them. Strongloop has a post comparing Express, Restify, hapi and LoopBack for building REST API's for anyone interested. <a href="http://strongloop.com/strongblog/compare-express-restify-hapi-loopback/" rel="nofollow">http://strongloop.com/strongblog/compare-express-restify-hap...</a>

wpietri超过 10 年前

From the article:> What did we learn from this harrowing experience? First, we need to fully understand our dependencies before putting them into production.Is that the lesson to learn? That scares me, because a) it's impossible, and b) it lengthens the feedback loop, decreasing systemic ability to learn.The lesson I'd learn from that would be something like "Roll new code out gradually and heavily monitor changes in the performance envelope."Basically, I think the approach of trying to reduce mean time between failure is self-limiting, because failure is how you learn. I think the right way forward for software is to focus on reducing incident impact and mean time to recovery.

评论 #8632611 未加载

评论 #8631938 未加载

ecaron超过 10 年前

My biggest takeaway from this article is that Netflix is moving from Express to Restify, and I look forward to watching the massive uptick this has on <a href="https://github.com/mcavage/node-restify/graphs/contributors" rel="nofollow">https://github.com/mcavage/node-restify/graphs/contributors</a>

评论 #8631429 未加载

评论 #8632469 未加载

forrestthewoods超过 10 年前

If I had to pick one line to highlight (not to criticize, but was a wise lesson worth sharing) it would be this one:"First, we need to fully understand our dependencies before putting them into production."

评论 #8631448 未加载

评论 #8631728 未加载

Fishrock123超过 10 年前

I would like to mention that Netflix could have consulted the express maintainers (us) but didn't.Source: myself - <a href="https://github.com/strongloop/express/pull/2237#issuecomment-59681175" rel="nofollow">https://github.com/strongloop/express/pull/2237#issuecomment...</a>

augustl超过 10 年前

A surprising amount of path recognizers are O(n). Paths/routes are a great fit for radix trees, since there's typically repetitions, like /projects, /projects/1, and /projects/1/todos. The performance is O(log n).I built one for Java: <a href="https://github.com/augustl/path-travel-agent" rel="nofollow">https://github.com/augustl/path-travel-agent</a>

评论 #8632669 未加载

degobah超过 10 年前

tl;dr:* Netflix had a bug in their code.* But Express.js should throw an error when multiple route handlers are given identical paths.* Also, Express.js should use a different data structure to store route handlers. EDIT: HN commentors disagree.* node.js CPU Flame Graphs (<a href="http://www.brendangregg.com/blog/2014-09-17/node-flame-graphs-on-linux.html" rel="nofollow">http://www.brendangregg.com/blog/2014-09-17/node-flame-graph...</a>) are awesome!

bcoates超过 10 年前

It's not just the extra lookups -- static in express is deceptively dog-slow. For every request it processes, it stats every filename that might satisfy the URL. This results in an enormous amount of useless syscall/IO overhead. This bit me pretty hard on a high-throughput webservice endpoint with an unnoticed extra static middleware. I wound up catching it with the excellent NodeTime service.Now that I look at it, there's a TOCTOU bug on the fstat/open callback, too: <a href="https://github.com/tj/send/blob/master/index.js#L570-L605" rel="nofollow">https://github.com/tj/send/blob/master/index.js#L570-L605</a>This should be doing open-then-fstat, not stat-then-open.

jaytaylor超过 10 年前

I am upset that the title has been changed from "Node.js in Flames". Which is not only the real title of the article, but also a reasonable description of what they've been facing with Node.#moderationfail

评论 #8638069 未加载

ajsharma超过 10 年前

This is the first I've heard of restify, but it seems like a useful framework for the main focus of most Node developers I know, which is to replace an API rather than a web application.

评论 #8631193 未加载

codelucas超过 10 年前

> This turned out be caused by a periodic (10/hour) function in our code. The main purpose of this was to refresh our route handlers from an external source. This was implemented by deleting old handlers and adding new ones to the array. Unfortunately, it was also inadvertently adding a static route handler with the same path each time it ran.I don't understand the need of refreshing route handlers. Could someone explain they needed to do this, and also why from an external source?

评论 #8632203 未加载

评论 #8631614 未加载

评论 #8631742 未加载

exratione超过 10 年前

The express router array is pretty easy to abuse, it's true. For example, as something you probably shouldn't ever do:<a href="https://www.exratione.com/2013/03/nodejs-abusing-express-3-to-enable-late-addition-of-middleware/" rel="nofollow">https://www.exratione.com/2013/03/nodejs-abusing-express-3-t...</a>I guess the Netflix situation is one of those that doesn't occur in most common usage; certainly dynamically updating the routes in live processes versus just redeploying the process containers hadn't occurred to me as a way to go.

hardwaresofton超过 10 年前

Responses are already firing in: <a href="https://news.ycombinator.com/item?id=8632220" rel="nofollow">https://news.ycombinator.com/item?id=8632220</a>

pm90超过 10 年前

I love these kinds of investigations into problems in production. I mean, you really have to admire their determination in getting to the root of the problem.In some ways, these engineers are not that different from academic researchers, in that they are devising experiments, verifying techniques, all in the pursuit of the question: why?

hit8run超过 10 年前

I would have written my apis in golang and not nodejs. Go is way faster in my experience and it feels leaner to create something because creating a web service can be productively doneout of box. Node apps tend to depend on thousands of 3rd party dependencies which makes the whole thing feel fragile to me.

MichaelGG超过 10 年前

Would someone explain what I'm missing about the flame graphs? Why are they indispensable here? In a normal profiler, you'd just expand the hot path and see what had the most samples. Apart from making recursion very explicit, what special aspect do flame graphs expose?

BradRuderman超过 10 年前

Why are they loading in routes from an external source? Is that normal, I have never seen that before.

评论 #8631327 未加载

bentcorner超过 10 年前

Interesting article. I have a lot of experience dealing with ETLs in WPA on the Windows side - it's an awesome tool that gives you similar insights. I haven't used it for looking at javascript stacks before though, so I don't know if it'll do that.

sysk超过 10 年前

> We also saw that the process’s heap size stayed fairly constant at around 1.2 Gb.> Something was adding the same Express.js provided static route handler 10 times an hour.Why didn't it increase the heap size? Maybe it was too small to be noticeable?

pcl超过 10 年前

Second, given a performance problem, observability is of the utmost importanceI couldn't agree with this more. Understanding where time is being spent and where pools etc. are being consumed is critical in these sorts of exercises.

dmitrygr超过 10 年前

So the lesson is to actually know the code you deploy to prod? Is that not obvious?

评论 #8632008 未加载

评论 #8631850 未加载

评论 #8632392 未加载

drinchev超过 10 年前

NodeJS Project has already a similar issue about recursive route matching.<a href="https://github.com/strongloop/express/issues/2412" rel="nofollow">https://github.com/strongloop/express/issues/2412</a>

debacle超过 10 年前

Doesn't this seem like a bug in the express router? All of the additional routes in the array are dead (can't be routed to).

评论 #8631567 未加载

Pharohbot超过 10 年前

I wonder how Netflix would perform with using Dart with the DartVM. I reckon it would be faster than Node based on benchmarks I've seen. Chrome DartVM support is right around the corner ;)

revelation超过 10 年前

Crazy talk. In 1ms, I can perspective transform a moderately big image. NodeJS cant iterate through a list.We really need a 60 fps equivalent for web stuff. You have 16ms, thats it.

评论 #8633918 未加载

coldcode超过 10 年前

I must admit I could enjoy just doing this type of analysis all day long. Yet I hate non computing puzzles.

qodeninja超过 10 年前

wow. I love that Netflix us using Node and even more curious that they would use express.

notastartup超过 10 年前

this is why you stick to tried and true methods folks. this is such a typical node.js fanboy mentality. "reinventing the wheels is justified because asynchronous". or "i want this trendy way to do things just because everyone else is jumping on the bandwagon".Give me flask + uwsgi + nginx anyday.

评论 #8635108 未加载

talkingtab超过 10 年前

an unfortunate title. Ha ha "flames" ha ha "Node.js" but the article is really about express. Not so "ha ha"

评论 #8632072 未加载

general_failure超过 10 年前

A very good reason to go with express is TJ. He was the initial author of express and he is quite brilliant when it comes to code quality. Of course, TJ is no more part of the community but his legacy lives :-)

gadders超过 10 年前

OFFTOPIC: "Today, I want to share some recent learnings from performance tuning this new application stack."The word you want is "lessons".

评论 #8631167 未加载

评论 #8631171 未加载

评论 #8632061 未加载