I find it interesting that the language that best deals with parallelism (IMO) was invented significantly before we moved in the multi-core direction.<p>The programming language landscape evolved and developed in the face of multi-code - async being a classic example. But the language that's often held up as the best solution to any given parallelism problem is Erlang. Erlang was built as a good programming model for concurrency on a single core, and then when multi-core came along, SMP was 'just' a VM enhancement and no programs needed changing at all to get the full advantage of all the cores (barring some special cases).
Past related threads:<p><i>The Free Lunch Is Over – A Fundamental Turn Toward Concurrency in Software</i> - <a href="https://news.ycombinator.com/item?id=15415039" rel="nofollow">https://news.ycombinator.com/item?id=15415039</a> - Oct 2017 (1 comment)<p><i>The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software (2005)</i> - <a href="https://news.ycombinator.com/item?id=10096100" rel="nofollow">https://news.ycombinator.com/item?id=10096100</a> - Aug 2015 (2 comments)<p><i>The Moore's Law free lunch is over. Now welcome to the hardware jungle.</i> - <a href="https://news.ycombinator.com/item?id=3502223" rel="nofollow">https://news.ycombinator.com/item?id=3502223</a> - Jan 2012 (75 comments)<p><i>The Free Lunch Is Over</i> - <a href="https://news.ycombinator.com/item?id=1441820" rel="nofollow">https://news.ycombinator.com/item?id=1441820</a> - June 2010 (24 comments)<p>Others?
I agree with this article a lot.<p>The layers over layers of software bloat increased with the hardware speed. The general trade-off is: We make the software as slow as possible to have the shortest development time with the least educated programmers. "Hey web dev intern, click me a new UI in the next two hours! Hurry!"<p>This intern-clicks-a-new-UI works because we have a dozen layers of incomprehension-supporting-technologies. You don't need to know how most parts of the machine are working, Several libraries, VMs, and frameworks will make you a bed of roses.<p>My point is that we a overdoing it with the convenience for the developers. Today there is way too much complexity and bloat in our systems. And there are not enough programmers trained to handle memory management and similar low-level tasks. Or their bosses wouldn't allow it because the deadline, you know.<p>I think the general trend is bad because there is no free lunch. No silver bullet. Everything is a trade-off. For example C is still important because C's trade-off between programming convenience and runtime efficiency is very good. You pay a lot and you get a lot.<p>This is also true for parallel programming. To write highly efficient parallel code you need skill and education. No silver bullet tooling will make this less "hard".<p>And here I see the irony. Faster CPUs were used to have lower educated devs delivering quicker. More parallel CPUs need higher skilled devs working slower to utilize the chip's full potential.
The free lunch isn't quite over, although significant advancements were made in parallel computing... it turns out that CPUs have been able to "auto-parallelize" your sequential code all along. Just not nearly as efficiently as explicitly parallel methodologies.<p>In 2005, your typical CPU was a 2.2GHz Athlon 64 3700+. In 2021, your typical CPU is a Ryzen 5700x at 3.8 GHz.<p>Single-threaded performance is far better than 72% faster however. The Ryzen 5700x has far more L3 cache, far more instructions-per-clock, far more execution resources than the ol' 2005 era Athlon.<p>In fact, server-class EPYC systems are commonly in the 2GHz range, because low-frequency saves a lot on power and servers want lower power usage. Today's EPYCs are still far faster per-core than the Athlons of old.<p>-------------<p>This is because your "single thread" is executed more-and-more in parallel today. Thanks to the magic of "dependency cutting" compilers, the compiler + CPU auto-parallelizes your code and runs them on the 8+ execution pipelines found on modern CPU "cores".<p>Traditionally, the CPUs in the 90s had a singular pipeline. But the 90s and 00s brought forth out of order execution, as well as parallel execution pipelines (aka: superscalar execution). That means 2 or more pipelines execute your "sequential" code, yes, in parallel. Modern cores have more than 8 pipelines and are more than capable of 4+ or 6+ operations per clock tick.<p>This is less efficient than explicit, programmer given parallelism. But it is far easier to accomplish. The Apple M1 continues this tradition of wider execution. I'm not sure if "sequential" is dead quite yet (even if there's a huge amount of machine working to auto-translate sequential into parallel technically... our code is largely written in a "sequential" fashion)<p>-------------<p>But the big advancements after 2005 was the rise of GPGPU compute. It was always known that SIMD (aka: GPUs) were the most parallel systems, from the late 1980s and early 90s the SIMD supercomputers always had the most FLOPs.<p>OpenCL and CUDA really took parallelism more / SIMD mainstream. And indeed: these SIMD systems (be it AMD MI100 or NVidia A100) are far more efficient and far higher compute capabilities than anything else.<p>The only "competitor" on the supercomputer scale is the Fugaku supercomputer, with SVE (512-bit SIMD) ARM using HBM RAM (same as the high-end GPUs). SIMD seems like the obvious parallel compute methodology if you really need tons and tons of compute power.
It seems to me that where we really ended up was distributed systems. We solve problems by not just making our code concurrent to use more cores, but by also making it use more computers.
There seems to be no way to efficiently replay concurrent programs in a deterministic fashion on multiple cores. Nondeterminism makes parallelism and concurrency inherently hard and unfriendly to new comers. It becomes even more difficult in recent years due to architecture decisions: weak memory order makes things worse.<p>Supposing you are going to write nontrivial concurrent programs like toy Raft, I believe that looking through RPC logs will be the most painful thing.<p>In contrast, on a single core, gdb is good enough. And there are also advanced examples like VMware's fault-tolerant VM and FundationDB's deterministic simulation. If we can debug concurrent programs without dirty tricks, just like single-threaded ones, I guess utilizing concurrency will be as handy as calling a function.
People having been saying this for decades and while it's true, concurrency is still widely regarded as 'too hard'.<p>I'm not sure if this is justified (e.g. concurrency is inherently too hard to be viable), or due to the lack of tooling/conventions/education.
One thing I’d love to do is a Smalltalk implementation where every message is processed in a separate thread. Could be a nice educational tool, as well as a great excuse to push workstations with hundreds of cores.
2005, when the most important platform was the desktop, when the virtualization was yet to be mature, and when the dominant system programming language is C++.<p>Today CPU cores are sliced by cloud vendors to sell out in smaller portion, and the phones are hesitant to go many-core as it will eat your battery in light-speed. Dark silicon is spent for domain specific circuits like AI, media or networking instead of generic cores.<p>Parallelism is still very hard problem in theory, but its practical need isn't as prevalent as we thought on a decade-plus ago, partly thanks for the cloud and mobile. For most of us, at least the parallelism is kind of solved-by-someone-else problem. It is left for the small number of experts.<p>Concurrency is still there, but the situation is much better than before (async/await, immutable data types, actors...)
I recall transactional memory being pitched to take a bite out of lock overheads associated with multithreading. (as opposed to lockless algorithms). Has it become mainstream?