Fast Servers

146 pointsby mr_tyzicover 9 years ago

18 comments

someone13over 9 years ago

Okay, this is a really cool post, but I have a small bit of criticism - the code samples are really hard to read. I'd recommend, at minimum, adding a bit more whitespace so you don't end up with lines like this:<pre><code> if(e[i].events&(EPOLLRDHUP|EPOLLHUP))close(e[i].data.fd); </code></pre> Despite that minor criticism - pretty cool stuff!

评论 #10874770 未加载

评论 #10873454 未加载

评论 #10874627 未加载

Matthias247over 9 years ago

Can't completly understand the main message of what is proposed here.Using multiple threads were one thread accepts connections and others process the connections is already quite standard (e.g. look on how Netty for Java works with a boss thread and worker threads).However the pattern won't work with blocking IO like it's suggested in the referred page if your worker thread should handle multiple connections. Even if poll tells you the connection is readable there might not be enough data for a complete request - so you need a state machine for reading again. Or you block until you have read a complete request and thereby block other connections that should be served by the same thread(pool). And if you block on writing responses then one slow connection will block the processing of the others.What also should be considered is that by far not all network protocols follow a pure request -> response model. If the server may send responses asynchronously (out of order) or if there is multicast/broadcast support in the protocol the requirements on software architecture look different.

markpapadakisover 9 years ago

There is nothing fundamentally new described in the blog post, although pinning threads to core is almost never the default operation mode (though many popular servers expose an option for turning this one).As someone else stated in another post, SO_REUSEPORT and accept4() is best, all things considered, way to accept connections across multiple threads. Again, most modern servers support this by default, if supported by the underlying operating system (e.g nginx).By the way, Aerospike accepts new connections in a single thread and then directly adds the socket FD to one of the i/o threads (round-robbing selection scheme) directly using e.g epoll_ctl(fd, EPOLL_CTL_ADD, ..).See <a href="http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html" rel="nofollow">http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-ma...</a> for costs of context switching, and performance improvement when pinning threads to cores (it's quite impressive). Also, note that according to the author, on average, a context switch is 2.5-3x more expensive when using virtualization.You may also want to read <a href="https://medium.com/software-development-2/high-performance-services-using-coroutines-ac8e9f54d727#.ho0s7q28b" rel="nofollow">https://medium.com/software-development-2/high-performance-s...</a> -- it's been a long time since I wrote this, but it describes how one can deal with asynchronous I/O and other operations that may block a thread.

评论 #10874654 未加载

halayliover 9 years ago

This is not an ideal design, surprised it has this many upvotes.Transferring the accepted connections this way involves an extra system call to epoll/kqueue to add the event into the particular thread and the accepting thread can become a bottleneck under high load.A better design would be to share the listening socket across threads and have each thread accept at its own pace, at least this avoids the additional kqueue/epoll system call needed to add the new fd into the thread's poller, but it does cause lock contention in the OS which is still less expensive than a system call. What's even better is if you're on a newer linux version, or bsd, consider using SO_REUSEPORT which allow each thread to bind/accept on the same port and avoids the lock contention issue.Also you should consider using accept4() to set the non-blocking flag during accept instead of the additional system call to set it to non-blocking.

评论 #10874616 未加载

wittrockover 9 years ago

Backpressure between threads and utilization could be hard here. Balance between the speed of the accept, request, and worker threads is something I'm curious about. In theory, you could create pools of each if you find that one set bottlenecks the other. Also, workload isolation is important--I'm curious how the author deals with (or avoids) transferring ownership of large amounts of memory and cache between cores without incurring significant transfer overhead.

评论 #10872642 未加载

RyanZAGover 9 years ago

This is pretty old, right? These days 'Fast Servers' refers to bare-metal network interface without a kernel which can achieve far higher throughput than passing it through a kernel/epoll.

评论 #10872856 未加载

nbevansover 9 years ago

This pattern is called "I/O completion ports" or more generally "overlapped I/O" in Windows and has been there since the very first version of Windows NT.

评论 #10873750 未加载

vogover 9 years ago

Very good pattern description, which demonstrates nicely that if basic/system assumptions are changing, new pattern need to embrace.I'd love to see more patterns like that.

评论 #10872518 未加载

notacowardover 9 years ago

"One worker per core" is too simplistic. Yes, as I wrote in a pretty well-known article a dozen years ago, it's a good starting point. Yes, it can avoid some context switching. On the other hand, when you have to account for threads doing background stuff it can be too many, leaving you with context thrashing between oversubscribed cores. When you account for threads blocking for reasons beyond your control (e.g. in libraries or due to page faults) it might be too few. The right number of worker threads is usually somewhere around the number of cores, but depends on your exact situation and can even vary over time.Those who do not know the lessons of history...

chillaxtianover 9 years ago

good job of explaining their paradigm, but i did not see any explanation as to why it is better than traditional epoll.what advantage do you gain by separating accepting connections from handling request / response?

评论 #10872542 未加载

评论 #10872538 未加载

ameliusover 9 years ago

> Fast ServersFast in what sense? Throughput, or latency?> One thread per coreOh, I guess the answer to the previous question is "throughput".

评论 #10872878 未加载

derefrover 9 years ago

This is all about achieving a dataflow architecture by exploiting CPU cache hierarchy, right?We're basically talking about going from a design where 10k little "tasklet" state machines are each scheduled onto "scheduler" threads; where during its turn of the loop, each tasklet might run whatever possibly-"cold" logic it likes......and turning it into a collection of cores that each have a thread for evaluating a particular state-machine state pinned, where each "tasklet" that gets scheduled onto a given core will always be doing the same logic, so that logic can stay "hot" in that core's cache—in other words, so that each core can function closer to a SIMD model.Effectively, this is the same thing done in game programming when you lift everything that's about to do a certain calculation (i.e. is in a certain FSM state) up into a VRAM matrix and run a GPGPU kernel over it†. This kernel is your separate "core" processing everything in the same state.Either way, it adds up to dataflow architecture: the method of eliminating the overhead of context-switching and scheduling on general-purpose CPUs, by having specific components (like "I/O coprocessors" on mainframes, or (de)muxers in backbone switches) for each step of a pipeline, where that component can "stay hot" by doing exactly and only what it does in a synchronous manner.The difference here is that, instead of throwing your own microcontrollers or ASICs at the problem, you're getting 80% of the same benefit from just using a regular CPU core but making it avoid executing any non-local jumps: which is to say, not just eliminating OS-level scheduling, but eliminating any sort of top-level per-event-loop switch that might jump to an arbitrary point in your program.This is way more of a win for CPU programming than just what you'd expect by subtracting the nominal time an OS context-switch takes. Rewriting your logic to run as a collection of these "CPU kernels"—effectively, restricting your code the same way GPGPU kernels are restricted, and then just throwing it onto a CPU core—keeps any of the kernel's cache-lines from being evicted, and builds up (and never throws away) an excellent stream of branch-prediction metadata for the CPU to use.The interesting thing, to me, is that a compiler, or an interpreter JIT, could (theoretically) do this "kernel hoisting" for you. As long there was a facility in your language to make it clear to the compiler that a particular function is an FSM state transition-function, then you can code regular event-loop/actor-modelled code, and the compiler can transform it into a collection of pinned-core kernels like this as an optimization.The compiler can even take a hybrid approach, where you have some cores doing classical scheduling for all the "miscellaneous, IO-heavy" tasklets, and the rest of the cores being special schedulers that will only be passed a tasklet when it's ready to run in that scheduler's preferred state. With an advanced scheduler system (e.g. the Erlang VM's), the profiling JIT could even notice when your runtime workload has changed to now have 10k of the same transition-function running all the time, and generate and start up a CPU-kernel-thread for it (temporarily displacing one of its misc-work classical scheduler threads), descheduling it again if the workload shifts so that it's no longer a win.Personally, I've been considering this approach with GPGPU kernels as the "special" schedulers, rather than CPU cores, but they're effectively equivalent in architecture, and perhaps in performance as well: while the GPU is faster because it gets to run your specified kernel in true SIMD parallel, your (non-NUMA) CPU cores get to pass your tasklets' state around "for free", which often balances out—ramming data into and out of VRAM is expensive, and the fact that you're buffering tasklets to run on the GPGPU as a group potentially introduces a high-latency sync-point for your tasklets. Soft-realtime guarantees might be more important than throughput.---† A fun tangent for a third model: if your GPGPU kernel outputs a separate dataset for each new FSM state each of the data members was found to transition to, and you have other GPGPU kernels for each of the other state-transition functions of your FSM waiting to take those datasets and run them, then effectively you can make your whole FSM live entirely on the GPU as a collection of kernel "cores" passing tasklets back and forth, the same way we're talking about CPU cores above.While this architecture probably wins over both the entirely-CPU and CPU-passing-to-GPGPU models for pure-computational workloads (which is, after all, what GPGPUs are supposed to be for), I imagine it would fall over pretty fast if you wanted to do much IO.Does anyone know if GPUs marketed specifically as GPGPUs, like the Tesla cards, have a means for low-latency access to regular virtual memory from within the GPGPU kernel? If they did, staying entirely on the GPGPU would definitely be the dominant strategy. At that point, it might even make sense to have, for example, a GPGPU Erlang VM, or entirely-in-GPGPU console emulators (imagine MAME's approach to emulated chips, but with each chip as a GPGPU kernel.)If you can get that, then effectively what you've got at that point is less a GPU, and more a meta-FPGA co-processor with "elastic allocation" of gate-arrays to your VHDL files. System architecture would likely change a lot if we ever got that in regular desktop PCs.

评论 #10874393 未加载

jbarzycover 9 years ago

check out <a href="http://dpdk.org" rel="nofollow">http://dpdk.org</a>. I just delivered a session on DPDK and SR-IOV. There's a whole set of libraries, classifications, and frameworks to tune/tweak linux systems on x86 in user space.

deathanatosover 9 years ago

It's hard to say from the article, but is the first example single threaded? If so, then yes, adding more threads is of course going to speed it up.However, they don't each need their own queue; epoll supports being polled by >1 thread. In such a setup _any_ thread available can handle any request; in the author's setup, you're going to need to make sure any particular thread doesn't get too bogged down if the requests are not equal. (That pick function is important.) I'd be more curious how those two compared. (The author's is certainly slightly easier to write, I think.)

评论 #10874660 未加载

JoshTriplettover 9 years ago

I'd love to see benchmark numbers comparing this approach to others.

评论 #10872702 未加载

评论 #10872689 未加载

listicover 9 years ago

Why aren't servers using the proposed pattern already?

评论 #10872736 未加载

评论 #10872842 未加载

cbsmithover 9 years ago

Has everyone forgotten the c10k site, which covers all this?

评论 #10872737 未加载