> which attempt to hide latency<p>It does exceptionally good job at these attempts. If you think manually managed caches are fun, read [1] for an illustration what amount of efforts is required to sum an array for an architecture where on-chip RAM is manually managed. Another interesting case was Cell CPUs in PS3, I don't have hands on experience but I've read that it was equally hard to develop for.<p>> A low-level language for such processors would have native vector types of arbitrary lengths.<p>A low-level language would have native vector types of exactly the same lengths as underlying hardware. "arbitrary" is overkill unless the CPU supports arbitrary-length vectors.<p>Despite not specified as a part of language standard, all modern C and C++ implementations support these things. Specifically, when compiling into AMD64 instructions, the compilers implement native vector types, and vector intrinsics, defined by Intel. Same with NEON, all modern compilers implementing what's written by ARM.<p>> you must be able to compare two structs using a type-oblivious comparison (e.g., memcmp)<p>Using memcmp on structures is not necessarily a great idea, these padding bytes can be random garbage, it's not specified.<p>> with enough high-level parallelism, you can suspend the threads.. The problem with such designs is that C programs tend to have few busy threads.<p>Not just C programs. User input is serial, it can only interact with 1 application at a time. Display output is serial, it delivers a sequence of frames at 60Hz. Web browsers tend to have few busy threads because JavaScript is single threaded, also streaming parsers/decompressors/decryptors are not parallelizable.<p>> ARM's SVE (Scalar Vector Extensions)—and similar work from Berkeley—provides another glimpse at a better interface between program and hardware.<p>Just because it's different does not automatically make it better. The main problem with scalable vectors, it seems to be designed for problems CPUs no longer solving. For massively parallelizable vertical-only FP32 and FP64 math, GPGPU is the way to go, an order of magnitude faster while also being much more power efficient. CPU SIMD is used for more than vertical-only math. One thing is non-vertical operations i.e. shuffles, trivial use case: transpose a 4x4 matrix in 4 registers. Another one is operations on very small vectors, CPUs even have a DPPS instruction for FP32 dot product. For both use cases, scalable vectors make little sense.<p>> a garbage collector becomes a very simple state machine that is trivial to implement in hardware<p>People tried that a few times, first with Lisp, then with Java chips. General-purpose CPUs were better.<p>> Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.<p>nVidia did precisely that, made a processor designed purely for compute speed. I wouldn't call them a commercial failure.<p>[1] <a href="https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf" rel="nofollow">https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pd...</a>