I briefly tried Futhark for a couple of weeks but it feels like wishful thinking.<p>Writing efficient GPU programs is all about modeling the hardware memory hierarchy in your algorithms, and using it efficiently, eg by controlling the memory transfers across the hierarchy.<p>Futhark doesn’t really let you do that, and the consequence is that if you need to use sort, inclusive_scan, or similar, either your Futhark compiler exposes a times primitive for that, or you are out of luck. Sure, that can be done by calling CUB, cutlass, cuFFT, etc. but if you need to solve a problem in your own domain, that means that you probably need to use something else.