One thing that really goes against my intuition is that user space threads (lightweight treads, goroutines) are faster than kernel threads. Without knowing too much assembly, I would assume any modern processor would make a context switch a one instruction affair. Interrupt -> small scheduler code picks the thread to run -> LOAD THREAD instruction and the processor swaps in all the registers and the instruction pointer.<p>You probably can't beat that in user space, especially if you want to preempt threads yourself. You'd have to check after every step, or profile your own process or something like that. And indeed, Go's scheduler is cooperative.<p>But then, why can't you get the performance of Goroutines with OS threads? Is it just because of legacy issues? Or does it only work with cooperative threading, which requires language support?<p>One thing I'm missing from that article is how the cooperativeness is implemented. I think in Go (and in Java's Project Loom), you have "normal code", but then deep down in network and IO functions, you have magic "yield" instructions. So all the layers above can pretend they are running on regular threads, and you avoid the "colored function problem", but you get runtime behavior similar to coroutines. Which only works if really every blocking IO is modified to include yielding behavior. If you call a blocking OS function, I assume something bad will happen.