We 30x'd our Node parallelism

202 点作者 bjacokes超过 5 年前

24 条评论

kevstev超过 5 年前

I was building scalable node applications a few years ago for a very large e-commerce player- millions of customers. I think node.js is a great platform, but its apparent simplicity means there are hordes, and I mean like 90+% of the community, that can "just get things done" without understanding what is going on under the hood at all. And to be fair, for most startupy types of companies that need to iterate fast, that is what you want to optimize for.My interview screening question was pretty simple- "Is node.js single threaded or multithreaded?" And to most, they spit back the blogspam headline- "Single threaded!" I think the most correct answer is "its complicated" but would accept that because most people would say that is the "right" answer. So I would follow up with- "what exactly happens in a default installation if we have say... 5 requests come in at exactly the same time to just return some static content from disk?" (Node's default threadpool is 4). And here is where you could see their understanding just fell apart. Some would say they would be handled entirely synchronously, others completely in parallel- but then had no idea what the cause of the parallelism was. Very few actually understood that node is an event loop executing javascript backed by a threadpool for async operations.Before reading this post, I was like eh this is a waste of time- its typical medium bullshit- they almost certainly found they were doing some blocking call in the event loop and then removed it and voila, 30x speedup. It was interesting because it was a lot worse! They spent all this time and hard work figuring out everything but what was taking so long in the event loop, and it seems that was the last place they actually looked.Anyway, node can be a highly scalable platform (<a href="https://changelog.com/podcast/116" rel="nofollow">https://changelog.com/podcast/116</a>) but you need to understand it or else it will bite you in the foot. When I was last doing this stuff, upwards of 80% of our time was being spent essentially just JSON.parse()'ing, and we were looking to move to protobufs to avoid that.

评论 #21783337 未加载

评论 #21783412 未加载

评论 #21786582 未加载

评论 #21783217 未加载

评论 #21784634 未加载

评论 #21783223 未加载

评论 #21787241 未加载

评论 #21784255 未加载

评论 #21785195 未加载

评论 #21787027 未加载

评论 #21783675 未加载

spamizbad超过 5 年前

The only way this makes sense to me is if they have to contend with lots of expensive parsing, event sequencing, and throttling requirements. Payment APIs, bank websites, etc can be quite byzantine. I could understand how one might code yourself into a corner with a monolithic node app and basically just say "F-it, we're doing this synchronously!"I don't even think it's a terribly bad thing to do assuming it favors feature velocity.... but at that point, I'd recommend moving away from Node towards something like Python. And if you wanted to dip your toes back into async plumbing land, explore Go or Elixir.

评论 #21784713 未加载

评论 #21783650 未加载

评论 #21782936 未加载

评论 #21782546 未加载

评论 #21782668 未加载

评论 #21782763 未加载

评论 #21782937 未加载

rauchp超过 5 年前

That was an interesting read, thanks for linking to it. It's hard finding articles online discussing Node and performance, most people just dismiss it as an unviable option due to scale and speed concerns. 30x really is quite the jump though.> Each Node worker runs a gRPC serverNot going to lie, this kind of surprised me. When I think of a Node backend I think of ExpressJS. Not because I think Express is better, but because it's been pushed around in the past few years as the fastest, simplest way of running a backend.Yet, if you're going to be running a gRPC server, why not use a more performant language with better multithreading support? I thought this article was about them optimizing a grandfathered-in solution (such as Express), but I can't tell why they built out a gRPC server in Node in the first place.

评论 #21782780 未加载

评论 #21783302 未加载

7777fps超过 5 年前

> We were running 4,000 Node containers (or "workers") for our bank integration service. The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop, and allowed us to ignore the variability in resource usage across different integrations. But since our total capacity was capped at 4,000 concurrent requests, the system did not gracefully scale.I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?Failing to scale because their previous approach to scaling was a worker per request, a model which was roundly moved away from, because that's how CGI and Apache modules worked and it didn't scale well.I thought one of the key selling points with Node was an fully async standard library, enabling better scaling in process.But then you read stories like this, and I find it hard to relate to the original problem.

评论 #21782621 未加载

评论 #21783243 未加载

评论 #21782790 未加载

评论 #21782479 未加载

评论 #21782372 未加载

评论 #21783476 未加载

评论 #21782644 未加载

评论 #21782620 未加载

评论 #21783014 未加载

评论 #21782580 未加载

评论 #21782402 未加载

评论 #21782571 未加载

评论 #21782378 未加载

评论 #21782498 未加载

评论 #21782860 未加载

评论 #21782874 未加载

mnutt超过 5 年前

I’d be curious to hear more about the circumstances that ended up with a blocked runloop. Are there hundreds of junior engineers, or perhaps third parties writing code that you don’t control? I have seen people accidentally write blocking code, but not at such an egregious rate that we couldn’t catch it in code review, or at worst the runloop detector would alert on it in prod and we would roll back the deploy.For instances where you actually know you need lots of CPU, there are now strategies for offloading that specific work, although they have taken a while to get nice and easy to use.

评论 #21783713 未加载

jdc0589超过 5 年前

On a positive note: this was a good write up.On a negative note: FOR THE LOVE OF ALL THAT IS HOLY, HOW DID THIS HAPPEN.

vmarchaud超过 5 年前

I've encounted different issues with NodeJS services in the past (and still do) both with CPU bottleneck and Heap allocations. So i wrote openprofiling-node [0] during this summer to help me profile my apps directly in production and export the result in a S3 bucket. I believe it may help someone else here so i'm posting it[0]: <a href="https://github.com/vmarchaud/openprofiling-node" rel="nofollow">https://github.com/vmarchaud/openprofiling-node</a>

pdimitar超过 5 年前

...Or you could just use Erlang or Elixir, where concurrency and parallelism come pretty much out of the box, with very little effort required for you to fine-tune the desired policy / strategy.The insistence on using Javascript is just beyond lunacy at this point.

评论 #21789179 未加载

nosianu超过 5 年前

They write (somewhere in the middle)> Since V8 implements a stop-the-world GC, new tasks will inevitably receive less CPU time, reducing the worker’s throughputBut there is this Google blog post vom January 2019:<a href="https://v8.dev/blog/trash-talk" rel="nofollow">https://v8.dev/blog/trash-talk</a>> Over the past years the V8 garbage collector (GC) has changed a lot. The Orinoco project has taken a sequential, stop-the-world garbage collector and transformed it into a mostly parallel and concurrent collector with incremental fallback.So I guess they used an older node.js version. The current LTS version is 12.x and it is from around the middle of this year.---PS: If the blog author reads this, there is an accessibility problem with the Google-hosted inline images. If I try - without ad blocker - in an anonymous window I see none of the inline images. Logged into Google with my own account I can see some but not all the images. Apparently which images I can see depends on being logged in to my Google account? I also tried IE Edge just to see if the browser makes a difference - no inline images visible there either.

评论 #21783188 未加载

评论 #21783719 未加载

评论 #21783102 未加载

awinter-py超过 5 年前

Compared to a compiled language, node / JIT langs make it difficult to know what will be fast in prod.V8 JIT means that things like order of keys in an object or number of different calls to a function might affect whether your function gets optimized.And there's no easy way to find out if a JS function is falling back to slow mode or to tell the buildsystem 'this is a hot path, don't let me write code that deopts this call'.

bfrog超过 5 年前

It's not clear from the article why they were only able to run one request per node process, but that alone would make it questionable why use Node at all then. The entire point of the environment has been nixed. The article is quite confounding to understand how they arrived at that point in the first place.

tyingq超过 5 年前

"Only 10% of Plaid's data pulls involve a user who is present"Since they provide an API, it seems like some of the calls where they think a user isn't present might actually have one present.

评论 #21782649 未加载

评论 #21782511 未加载

Scarbutt超过 5 年前

I don't want to be that guy, but why did they start with nodejs for something like this instead of using the JVM or Go?

评论 #21782690 未加载

评论 #21782768 未加载

rynop超过 5 年前

I’d be curious to hear your reevaluation of moving this to Lambda after some of the major announcements during re:invent. My guess is some of the reasons you went ECS have been addressed with these announcements. Obviously some of the new features are still preview, but would be interested to hear your analysis none the less.

评论 #21784043 未加载

tyingq超过 5 年前

Does node have something similar to how apcu is used with PHP?That is, an mmap based kv store so that if you choose to run more than one node process on a single server, it has a fast kv cache?I'm aware you can use redis or similar, but a simple mmap kv store is simpler and faster for a single server use case.

评论 #21783399 未加载

评论 #21782437 未加载

mceachen超过 5 年前

In case anyone else gets excited by JSONStream, know that the package hasn't been updated in over a year, and the GitHub repo was archived by the author with no link to a successor.

评论 #21783897 未加载

评论 #21783743 未加载

FanaHOVA超过 5 年前

$300k is $300k, but they just raised $250M last year, is this a really good use of time for their engineering team? That's a little above ~0.1% of capital.

评论 #21784905 未加载

评论 #21784001 未加载

评论 #21783958 未加载

deedubaya超过 5 年前

A good example of avoiding premature optimization. I'd imagine delaying tackling this problem freed them up to tackle problems that impact users.

评论 #21782858 未加载

supermatt超过 5 年前

Ironic. Linked images failing to display due to "Rate limit exceeded"...

mirekrusin超过 5 年前

4k containers? That's microservices going macro big time.

GordonS超过 5 年前

I don't like to be overly negative, especially when a company/team is being transparent about what they're doing and giving insight into their engineering practices - but has anyone else's estimation of Plaid's engineering team just gone down the toilet?This blog post gives me the impression that either Plaid is filled with either junior or incompetent engineers - to scale to 4k containers serving 1 request each for an API workload is absolute insanity.These engineers are building stuff for banking. Banking!! There is literally no way I'm going near Plaid with a very long bargepole after reading this.It I was someone senior at Plaid, I'd be pulling this blog post before it harms reputation any further.

评论 #21784786 未加载

评论 #21785833 未加载

评论 #21784785 未加载

评论 #21785522 未加载

评论 #21785016 未加载

评论 #21785236 未加载

评论 #21786978 未加载

评论 #21785136 未加载

评论 #21784983 未加载

评论 #21785201 未加载

CyanLite2超过 5 年前

TLDR: How to spend millions of dollars of our investors' money because we hired junior devs who chose a framework that was trendy but couldn't scale.

Phil_Latio超过 5 年前

> We were running 4,000 Node containersLOL

PixyMisa超过 5 年前

Nobody involved in this project should be allowed to ever be in the same room as a computer again.

评论 #21786962 未加载

评论 #21783427 未加载

评论 #21784056 未加载

24 条评论

kevstev超过 5 年前

评论 #21783337 未加载

评论 #21783412 未加载

评论 #21786582 未加载

评论 #21783217 未加载

评论 #21784634 未加载

评论 #21783223 未加载

评论 #21787241 未加载

评论 #21784255 未加载

评论 #21785195 未加载

评论 #21787027 未加载

评论 #21783675 未加载

spamizbad超过 5 年前

评论 #21784713 未加载

评论 #21783650 未加载

评论 #21782936 未加载

评论 #21782546 未加载

评论 #21782668 未加载

评论 #21782763 未加载

评论 #21782937 未加载

rauchp超过 5 年前

评论 #21782780 未加载

评论 #21783302 未加载

7777fps超过 5 年前

评论 #21782621 未加载

评论 #21783243 未加载

评论 #21782790 未加载

评论 #21782479 未加载

评论 #21782372 未加载

评论 #21783476 未加载

评论 #21782644 未加载

评论 #21782620 未加载

评论 #21783014 未加载

评论 #21782580 未加载

评论 #21782402 未加载

评论 #21782571 未加载

评论 #21782378 未加载

评论 #21782498 未加载

评论 #21782860 未加载

评论 #21782874 未加载

mnutt超过 5 年前

评论 #21783713 未加载

jdc0589超过 5 年前

On a positive note: this was a good write up.On a negative note: FOR THE LOVE OF ALL THAT IS HOLY, HOW DID THIS HAPPEN.

vmarchaud超过 5 年前

pdimitar超过 5 年前

评论 #21789179 未加载

nosianu超过 5 年前

评论 #21783188 未加载

评论 #21783719 未加载

评论 #21783102 未加载

awinter-py超过 5 年前

bfrog超过 5 年前

tyingq超过 5 年前

评论 #21782649 未加载

评论 #21782511 未加载

Scarbutt超过 5 年前

I don't want to be that guy, but why did they start with nodejs for something like this instead of using the JVM or Go?

评论 #21782690 未加载

评论 #21782768 未加载

rynop超过 5 年前

评论 #21784043 未加载

tyingq超过 5 年前

评论 #21783399 未加载

评论 #21782437 未加载

mceachen超过 5 年前

In case anyone else gets excited by JSONStream, know that the package hasn't been updated in over a year, and the GitHub repo was archived by the author with no link to a successor.

评论 #21783897 未加载

评论 #21783743 未加载

FanaHOVA超过 5 年前

$300k is $300k, but they just raised $250M last year, is this a really good use of time for their engineering team? That's a little above ~0.1% of capital.

评论 #21784905 未加载

评论 #21784001 未加载

评论 #21783958 未加载

deedubaya超过 5 年前

A good example of avoiding premature optimization. I'd imagine delaying tackling this problem freed them up to tackle problems that impact users.

评论 #21782858 未加载

supermatt超过 5 年前

Ironic. Linked images failing to display due to "Rate limit exceeded"...

mirekrusin超过 5 年前

4k containers? That's microservices going macro big time.

GordonS超过 5 年前

评论 #21784786 未加载

评论 #21785833 未加载

评论 #21784785 未加载

评论 #21785522 未加载

评论 #21785016 未加载

评论 #21785236 未加载

评论 #21786978 未加载

评论 #21785136 未加载

评论 #21784983 未加载

评论 #21785201 未加载

CyanLite2超过 5 年前

TLDR: How to spend millions of dollars of our investors' money because we hired junior devs who chose a framework that was trendy but couldn't scale.

Phil_Latio超过 5 年前

> We were running 4,000 Node containersLOL

PixyMisa超过 5 年前

Nobody involved in this project should be allowed to ever be in the same room as a computer again.

评论 #21786962 未加载

评论 #21783427 未加载

评论 #21784056 未加载