Fwiw, the original implementation of threads in Java were 'green' threads (it was project Green after all). These were implemented in SunOS (vs System V) using setjmp/longjmp and an some auto-yield logic in the JVM (basically when you did certain things it would start that thing and yield your thread while it completed). When I started I had the most System V experience so I ported it over the Solaris to get to what was affectionately called the 'hello hello hello world world world' stage. The choice to use the Solaris native threads over green threads was motivated by the work that had gone on in the Solaris kernel to do CPU scheduling efficiently. That gave Java multiprocessing for 'free'. Except it exposed a bunch of bugs in the mutex / synchronized code when you actually got parallel execution :-).<p>The magic, as the author points out, is dynamic stack allocation. If you disassociate a thread with a required stack size, then you are back to using just the amount of memory that your object heap and reference stores are using. GC gets to be trickier to of course.<p>This should nominally be a non-issue on 64 bit address machines as you only need <i>map</i> the pages that actually have data in them (minimum one page) but that still puts a huge load on the page tables. This is especially true with huge pages which are used to avoid tlb thrashing and to save on the number of levels you have to go through to get back the physical address of the page you are trying to access.<p>Its the kind of optimization problem that systems folks really like to dig into and optimize around.
It's all about fixed resource allocations. The default stack size for user-land threads tends to be 1MB or 2MB, and there's also (smaller) kernel stacks.<p>In implementations of languages like Scheme or Haskell where no stack may even be allocated, or where stacks might be allocated in chunks, the cost of a co-routine can be very small -- as small as the cost of a closure. If you take that to the limit and allocate every call frame on the heap, and if you make heavy use of closures/continuations, then you end up with more GC pressure because instead of freeing every frame on return you have to free many of them via the GC.<p>In terms of ease of programming to deal with async events, the spectrum runs from threads on the one hand to callback hell on the other. Callback hell can be ameliorated by allowing lambdas and closures, but you still end up with indentation and/or paren/brace hell. Co-routines are somewhere in the middle of the spectrum, but closer to threads than to callback hell. Await is something closer to callback hell with nice syntax to make things easier on the programmer.<p>Ultimately it's all about how to represent state.<p>In threaded programming program state is largely implicit in the call stack (including all the local variables). In callback hell program state is largely explicit in the form of the context argument that gets passed to the callback functions (or which they close over), with a little bit of state implicit in the extant event registrations.<p>Threaded programming is easier because the programmer doesn't have to think about how to compactly represent program state, but the cost is higher resource consumption, especially memory. More memory consumption == more cache pressure == more cache misses == slower performance.<p>Callback hell is much harder on the programmer because it forces the programmer to be much more explicit about program state, but this also allows the programmer to better compress program state, thus using fewer resources, thus allowing the program to deal with many more clients at once and also be faster than threaded programming.<p>Everything in computer science in the async I/O space in the last 30 years has been about striking the right balance between minimizing program state on the one hand and minimizing programmer pain on the other. Continuations, delineated continuations, await, and so on -- all are about finding that sweet spot.
I'm surprised Loom hasn't been mentioned yet in this thread.<p>Proposal: <a href="http://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.html" rel="nofollow">http://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.htm...</a><p>It's in the early phase, but there is a prototype currently.<p>The commits in the mailing list mention a recent zero copy continuation thawing if I'm not mistaken.<p><a href="http://mail.openjdk.java.net/pipermail/loom-dev/2018-October/thread.html" rel="nofollow">http://mail.openjdk.java.net/pipermail/loom-dev/2018-October...</a><p>EDIT: I'm mistaken, it is about lazy stack walking (but I remember zero copy continuation thawing might be attempted)
> In Apache, each request is handled by 1 OS thread, effectively capping Apache to thousands of concurrent connections<p>That's wrong. There are a variety of different Apache mpm's: <a href="https://httpd.apache.org/docs/2.4/mpm.html" rel="nofollow">https://httpd.apache.org/docs/2.4/mpm.html</a><p>In the footnotes, it also points out that Erlang uses a similar system, which is true, and worth a look as well.
In typical Java fashion, there is a project, Quasar, the adds true green thread support to the language. Done by instrumenting bytecode rather than JVM support, but the benchmarks are still excellent. I'm very curious whether the Java or Go implementation is faster. Java does have fast async support which is all you need to add green threads on a library level<p>Also in Java fashion, the support isn't perfect, and requires some fiddling with JVM parameters (and annotations, obivously). It's still really solid from my experience however, and a shame that it isn't more widely known.<p>A partner project Comsat provides fairly comprehensive library support for databases and such. I've used it before with Vert.X and the performance was insane. Possibly faster than Go.<p>Another aside, there's an active research project to add greaan threads to Javas core as well. Project Loom
Go may have more threads, but it is not better at concurrency. It lacks several key things that make Java much easier to scale up, even if Go is easier to get started with. On my wish list:<p>* Thread Local storage. A million Goroutines means nothing is they are all fighting over shared storage. Consider trying to implement Java's LongAdder class, which does intelligent sharding based on contention.<p>* ConcurrentHashMap. sync.Map is horrendously contentious for writes. It's not even possible to write one yourself because Go hides the built in hash function it generates for each type.<p>* Goroutine joining. This is between a rock and a hard place, because if you don't wait till Goroutines are done, you risk leaks, and if you do wait, you need to put WaitGroups all over the place.<p>All those millions of goroutines don't help if you don't have the tools to coordinate them.
The article is off by an order of magnitude in one of the more headline grabbing comparisons. It states 2.5 million goroutines in a gigabyte of RAM, when it should be 250k at 4KB per stack. Still impressive on its difference, but less so.
I think it is so important to not arbitrarily throw words like 'RAM' in such discussions, and use more unambiguous terminology such as - virtual address space and the resident set size (RSS).<p>On a 64-bit machine, with 1MB stack size (this is a claim on the virtual address space, not the physical memory), you can have millions of threads too (that you may hit a lower limit due to other OS level knobs is another matter).
<i>> Supporting truly massive concurrency requires another optimization: Only schedule a thread when you know it can do useful work!</i><p>The problem is the scheduler doesn't know when a goroutine can do useful work. It thinks that if a new value arrives to a channel there is a good chance the goroutine will do something useful but that's not the case in general. My favorite example is Game of Life (CA) implemented as a network of goroutines communicating their state changes through channels: you have to collect 8 updates from predecessors to update your state, meaning you'll be scheduled to run 8 times before you finally make a real progress. Not good for scalability.
> CPUs can only execute about one logical thread per core truly concurrently.1 An inherent side effect: If you have more threads than cores, threads must be paused to allow other threads to do work, then later resumed when it’s their turn again.<p>When would millions of goroutines be useful if you can only run ~4 threads in parallel?
joe damato claims the reason threads were historically slower than userspace threading is because of an accident of map_32bit in glibc<p>> So I think I've identified the culprit and I believe the culprit is XFree86.<p>> it took seven years after this change went in for someone to figure out all the different parts that were damaged and fix them later, right? So threads were broken for a really long time.<p><a href="https://www.deconstructconf.com/2017/joe-damato-all-programmers-must-learn-c-and-assembly" rel="nofollow">https://www.deconstructconf.com/2017/joe-damato-all-programm...</a>
What's the use-case for many virtual threads? I thought the main case for using threads was performance, where you need to use them to fully utilize the machine, although rather you wouldn't be using threads.<p>Another use-case for threads that I see is when you have work-queues that have different priorities, so you put each queue into its own thread with a matching priority. For example, 1 thread for responding to UI events, and 1 lower priority thread for all background work.<p>Anyways, none of these use-cases require many threads. What use-case am I missing that makes go-routines useful?<p>I guess doing something in a thread is a fail-safe way to make sure that 'some time' is spent processing it, so for example for the UI tasks, it would be easy to just dump all handling of UI events in virtual threads.
<i>For use cases where a high degree of concurrency is required, it’s simply the only option.</i><p>You can get high concurrency in Java by avoiding the 1 thread per request/connection model and using NIO with easy-to-use libraries like Netty.
This article is wrong and reflects what appears to be a widely-held belief (practically canon here on HN) that a thread has a fixed-size stack. This is not the case on every operating system! Linux, which I've heard is popular, initially commits one page for a thread stack and continues to add pages until that thread's stack limit is hit. So this statement from which this article draws its conclusion is incorrect:<p>"each OS thread has its own fixed-size stack."<p>100% wrong.
Comparing Java threads and goroutines doesn't really work because Java's thread model maps threads 1:1 to OS threads, whereas goroutines are simply continuations that can consume any OS thread. Java threads couple both the concerns of scheduling and task definition. Java Runnables address only the concern of task definition and are closer in nature to what goroutines are, but they still don't possess the ability to suspend and yield their execution.<p>You can implement Go's thread model (more generally known as green threads, coroutines, fibers, lightweight threads) on the JVM. This is precisely what Project Loom intends to do and what Scala effect systems already do. You construct a program in a type called IO that only describes a computation that can be suspended and resumed, unlike Java Runnables. These IO values are cooperatively scheduled to run on a fixed thread pool. Because these IO values just represent continuations, you can have millions of them cooperatively executing at the same time. And there is no context switch penalty for switching the continuation you are executing for another.
You can easily have millions of coroutines on the JVM with Kotlin coroutines:
<a href="https://kotlinlang.org/docs/reference/coroutines/coroutines-guide.html" rel="nofollow">https://kotlinlang.org/docs/reference/coroutines/coroutines-...</a>
The article claims that aside from memory consumption the main advantage of the Go approach is that the scheduler is integrated with the channels concept and thus knows which goroutines are blocked and don't need to be scheduled until some data shows up in their channel.<p>But don't OS threads work like this as well, by being integrated with file descriptors among other things? If a thread is blocked on a read, the OS knows this and won't keep scheduling that thread - right?<p>If so, the article's argument about time wasted on scheduling doesn't make sense.
This post is wrong. OS threads do not have a fixed-size stack in the 1:1 model. After all, if they did, then Go couldn't do stack growth, because goroutines are implemented as a layer on top of native threads. There are language runtimes, such as SML/NJ, that <i>heap allocate</i> all stack frames, and the kernel doesn't complain.<p>Additionally, as others have pointed out, Java (and, for that matter, all native code on Linux) used to use the Go model and switched to 1:1.
Ok there's a technical gaff somewhere... In my understanding. I regularly have hundreds of threads running in jvms with between 128mb-384mb of heap. Im guessing the stack isn't kept on the heap?
I'm not really a java user but I vaguely recall that java started with green threads to deal with this issue long before you could just run an absurd number of threads in the OS without issues
Would it be possible/efficient to write an OS in Go, and use its scheduler to schedule threads living in the implemented OS? What if you ran Go in that OS?
Creating millions of coroutines or "goroutines" isn't really interesting by itself. I believe the real questions for me are:<p>1. is Go scheduler better than Linux scheduler if you have a thousand concurrent goroutines or threads?<p>2. is really creating goroutines that much faster than creating a native thread?<p>3. are gorutines relevant for long lived and computationally intensive tasks or just for short lived I/O task?<p>4. What is the performance of using channels between goroutines compared to native threads?<p>tbh, I have read several of respected articles that criticize Golang goroutines in terms of performance and I am not really sure that Golang's only virtue imho which is simple concurrency is performant at all
I imagine there are actual use cases for it but what are programs doing spinning up thousands of threads? I've never seen a program like that. Whatever happened to worker thread pools?
There makes no sense to comparing goroutines with os threads, they are totally different things, this is like comparing apples with oranges, of course both of them can be used
to solve the same problem, but they are totally different.
Like others have mentioned, is well known that a thread-per-connection solution does not scale well, a quite better approach is having a pool of threads with fixed size and using an asynchronous non-blocking event driven model.
Actually in our company we did some benchmark messuring golang (1.9 with well known http servers) and java 8 with Netty, and the latter always won.