Multi-core scaling: it’s not multi-threaded

65 pointsby lukes386about 12 years ago

11 comments

nkurzabout 12 years ago

He suggests an interesting approach.1) Tell the kernel it only has a limited set of cores to work with.The way to fix Snort’s jitter issues is to change the Linux boot parameters. For example, set “maxcpus=2”. This will cause Linux to use only the first two CPUs of the system. Sure, it knows other CPU cores exist, it just will never by default schedule a thread to run on them.2) Manually schedule your high priority process onto a reserved core.Then what you do in your code is call the “pthread_setaffinity_np()” function call to put your thread on one of the inactive CPUs (there is Snort configuration option to do this per process). As long as you manually put only one thread per CPU, it will NEVER be interrupted by the Linux kernel.3) Turn off interrupts to keep things as real time as possible.You can still get hardware interrupts, though. Interrupt handlers are really short, so probably won’t exceed your jitter budget, but if they do, you can tweak that as well. Go into “/proc/irq/smp_affinity” and turn of the interrupts in your Snort processing threads.4) Profit?At this point, I’m a little hazy at what precisely happens. What I think will happen is that your thread won’t be interrupted, not even for a clock cycle.Can anyone remove the haziness? I'm more interested in this for benchmarking than performance, and wonder how it compares to other ways of increasing priority like "chrt". Is booting with a low "maxcpus" necessary, or can the same be done at runtime?

评论 #5260813 未加载

评论 #5260949 未加载

评论 #5261261 未加载

6renabout 12 years ago

Hypothesis: we will never solve multi-core for general purpose computing (there's also <a href="http://en.wikipedia.org/wiki/Amdahl%27s_law" rel="nofollow">http://en.wikipedia.org/wiki/Amdahl%27s_law</a>). But we can do multi-core for the embarrassing parallelizable - such as graphics (top GPUs have over 1000 cores), so instead of solving this problem, our focus will shift to those tasks for which multi-core does work - because it's only these that keep improving at Moore's Law-like rates.Arguably, this is already happening.

评论 #5260808 未加载

speederabout 12 years ago

I wonder if we will ever figure a way to resume improving clock cycles instead of adding more parallelism.Parallelism has two major issues:First, not all applications need it, in many cases you want to do just a series of operations in a single starting number, and you don't need anything else, like if you are for example calculating a factorial, if you need only one factorial, it is useless to make it more parallel.Second, it is absurdly hard to code stuff for heavily parallelised hardware, most coders will make crap code that don't work, no matter how good we become in making helper libraries, it is totally another way of thinking.Yes, for some things, like servers, where you can throw a user into each core, it is nice... But for many other uses, even simple single-core parallelism, like SIMD, is not much useful.

评论 #5260630 未加载

评论 #5260912 未加载

评论 #5260522 未加载

评论 #5260596 未加载

评论 #5260672 未加载

评论 #5260689 未加载

jwsabout 12 years ago

…from 33-MHz to 3-GHz, a thousand-fold increase…There had to be a better way to write that. I suppose more work per clock cycle and increased number of cores contributes the other x10 of raw performance. But then the author goes on to say they are stuck, which isn't true of performance, only clock rate. In any event, putting an "up is down" in your sentence should generally be avoided.Edit: The >>>proscribed<<< method for resolving this is a “lock”, where… Sigh.The article covers a lot of ground lightly. It talks about the new Haswell transactional memory instructions, the way Linux shards network counters, and a way to make Linux not use a core so you can schedule a process on it that will never be preempted.

评论 #5260519 未加载

kyrraabout 12 years ago

Tangentially related, Snort was doing research to move to a multi-threaded architecture, but decided against it due to cache synchronization problems [1]. Though, their thoughts about splitting up processing was quite different than what the OP blog post suggests.It looks like Snort gave up on one way of doing multi-threading, but they could still go the way suggested in the OPs post.[1] <a href="http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-released.html" rel="nofollow">http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-re...</a>

评论 #5261492 未加载

javertabout 12 years ago

So, this post has a number of errors, and is fundamentally wrong.(a) pthread_mutex_t and friends use futexes, which only call into the kernel when there actually is contention.(b) it would be better to use chrt (change to real-time priority) than the maxcpus trick, because the former accomplishes the same thing, but allows the core to still be used if the high-priority thread suspends (e.g. to do disk or network I/o).(c) Contrary to his claim about Snort, there is no reason to prefer a multiprocess design over a multithreaded design for a particular application. There is no savings in overhead or synchronization or anything like that by going with processes. In fact, using processes and then using memory mapping to share when you could use threads, is just making things harder for yourself for no reason.(d) What I’m trying to show you here is that “multi-core” doesn’t automatically mean “multi-threaded”. Well, in computer science terminology, a thread is a schedulable entity, and a process is a schedulable entity with memory protection. So, he's wrong. Lots of developers talk about threads and processes as orthogonal things, though, so I can see why he made that claim.(e) The overall theme of my talk was to impress upon the audience that in order to create scalable application, you need to move your code out of the operating system kernel. You need to code everything yourself instead of letting the kernel do the heavy lifting for you. That is horrible advice that is just going to lead to lots of bugs and wasted effort. It's premature optimization. Even most people using the Linux realtime preemption patch (PREEMPT_RT) do not have such strict requirements that they need to take this advice.(f) Your basic design should be one thread per core and lock-free synchronization that never causes a thread to stop and wait. Might apply to certain very specific real-time (as in, embedded systems or HFT) scenarios, but in general, no, you're just wasting the core when that one thread doesn't need to use it. Prefer real-time priorities if you really need it.(g) Specifically, I’ve tried to drill into you the idea that what people call “multi-threaded” coding is not the same as “multi-core”. Multi-threaded techniques, like mutexes, don’t scale on multi-core. Again, you can only use multiple cores in parallel if there are multiple threads. And multi-threaded techniques do scale. You definitely may want to use lock-free synchronization instead of mutexes in some specialized scenarios, though.EDIT: OK, here is one other thing I forgot in the list above.(h) There are multiple types of locks, like spinlocks, mutexes, critical sections, semaphores, and so on. Even among these classes there are many variations. Technically, mutexes and semaphores both are ways of protecting critical sections, and a spinlock is a way of implementing a lock (including, possibly, a mutex or semaphore lock). Again, this is to some degree the difference between developers with a shared lingo and computer scientists. But if you go by that kind of lingo, you're missing part of the picture.

评论 #5261166 未加载

评论 #5261232 未加载

评论 #5261375 未加载

评论 #5261649 未加载

nonamegivenabout 12 years ago

"Multi-tasking was the problem of making a single core run multiple tasks. The core would switch quickly from one task to the next to make it appear they were all running at the same time, even though during any particular microsecond only one task was running at a time. We now call this “multi-threading”, where “threads” are lighter weight tasks."I must have missed something.Multi-tasking is multiple processes, which mostly have nothing to do with each other, switched in and out of the processor(s) by the OS, which do not share in-process memory or context. The programmer does nothing to make this happen, and normally has little to no say in it.Multi-threading is a single process, where the threads carefully share context and memory, and they're all working roughly on the same thing; the programmer makes this happen explicitly, and usually fucks it up.<a href="https://en.wikipedia.org/wiki/Multitasking" rel="nofollow">https://en.wikipedia.org/wiki/Multitasking</a><a href="https://en.wikipedia.org/wiki/Multitasking#Multithreading" rel="nofollow">https://en.wikipedia.org/wiki/Multitasking#Multithreading</a>

stefantalpalaruabout 12 years ago

"Multi-threaded software goes up to about four cores, but past that point, it fails to get any benefit from additional cores."Is there any basis for this affirmation or just the fact that his system has only 4 cores?

评论 #5261511 未加载

评论 #5261364 未加载

评论 #5261530 未加载

javertabout 12 years ago

There are two fundamental ways of doing this: pipelining and worker-threads. In the pipeline model, each thread does a different task, then hands off the task to the next thread in the pipeline.Why not just implement the pipeline entirely in one thread, and then replicate them (just like worker threads)?What will happen is that the first worker thread will be executing stage 2, while the second worker thread is executing stage 1. The OS will automatically schedule them on different cores.Am I missing something?

评论 #5261257 未加载

评论 #5261306 未加载

migaabout 12 years ago

I recall similar results on nearly all applications since my late MSc study: Mutexes are bad, pipes and sockets give better scaling. Thread sync primitives just sometimes scale up to 8-12 cores, but indeed multiprocess applications usually get much faster. In the age of GCed VMs one needs to also consider sync cost of GC.

abraininavatabout 12 years ago

Maybe I'm missing something, but I'm not getting the point. It seems to me there's no fundamental difference between multiprocess with shared-memory regions for anything that needs to be shared and multithreaded with mostly thread-local storage plus some shared data. The kernel is going to multiplex your single-threaded processes among the available cores just the same as it will multiplex your multiple threads among the available cores.Multi-threaded techniques, like mutexes, don’t scale on multi-core. Conversely, as Snort demonstrates, you can split a problem across multiple processes instead of threads, and still have multi-core code.Synchronization is synchronization. There are inter-process synchronization primitives, including mutexes. And you can use lock-free synchronization in a single-process multi-threaded scenario.