Bend: a high-level language that runs on GPUs (via HVM2)

1041 点作者 LightMachine12 个月前

83 条评论

Twirrim12 个月前

For what it's worth, I ported the sum example to pure python.<pre><code> def sum(depth, x): if depth == 0: return x else: fst = sum(depth-1, x*2+0) # adds the fst half snd = sum(depth-1, x*2+1) # adds the snd half return fst + snd print(sum(30, 0)) </code></pre> under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information:<pre><code> CPU, Apple M3 Max, 1 thread: 3.5 minutes CPU, Apple M3 Max, 16 threads: 10.26 seconds GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds </code></pre> The bend single-threaded version has been running for 42 minutes on my laptop, is consuming 6GB of memory, and still hasn't finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04). That seems to be an incredibly slow interpreter. Has this been tested or developed on anything other than Macs / aarch64?I appreciate this is early days, but it's hard to get excited about what seems to be incredibly slow performance from a really simple example you give. If the simple stuff is slow, what does that mean for the complicated stuff?If I get a chance tonight, I'll re-run it with `-s` argument, see if I get anything helpful.

评论 #40392519 未加载

评论 #40392925 未加载

评论 #40396573 未加载

评论 #40397638 未加载

评论 #40395872 未加载

评论 #40395867 未加载

vegadw12 个月前

A lot of negativity in these threads. I say ~cudas~ kudos to the author for getting this far! The only similar project I'm aware of is Futhark, and that's haskell-y syntax - great for some people, but to the general class of C/C++/Python/Js/Java/etc. devs pretty arcane and hard to work with. My biggest complaint with this is, unlike Futhark, it only targets Cuda or multi-core. Futhark which can target OpenCL, Cuda, ISPC, HIP, sigle core CPU, or multi core CPU. The performance problems others are pointing out I'm certain can be tackled.

评论 #40393958 未加载

评论 #40393752 未加载

评论 #40396652 未加载

CorrectingYou12 个月前

OP comes around with some of the coolest things posted in HN recently, and all he gets is extensive criticism, when it is clear that this is an early version :/

评论 #40396296 未加载

评论 #40396414 未加载

评论 #40400232 未加载

评论 #40403176 未加载

评论 #40394764 未加载

评论 #40395952 未加载

andrewp12312 个月前

I just wanted to comment on how good the homepage is - it's immediately clear what you do. Most people working with "combinators" would feel a need to use lots of scary lingo, but OP actually shows the simple idea behind the tool (this is the opposite take of most academics, who instead show every last detail and never tell you what's going on). I really appreciate it - we need more of this.

评论 #40395796 未加载

praetor2212 个月前

Look, I understand the value proposition and how cool it is from a theoretical standpoint, but I honestly don't think this will ever become relevant.Here are some notes from my first impressions and after skimming through the paper. And yes, I am aware that this is very very early software.1. Bend looks like an extremely limited DSL. No FFI. No way of interacting with raw buffers. Weird 24bit floating point format.2. There's a reason why ICs are not relevant: performance is and will always be terrible. There is no other way to put it, graph traversal simply doesn't map well on hardware.3. The premise of optimal reduction is valid. However, you still need to write the kernels in a way that can be parallelized (ie. no data dependencies, use of recursion).4. There are no serious examples that directly compare Bend/HVM code with it's equivalent OMP/CUDA program. How am I suppose to evaluate the reduction in implementation complexity and what to expect on performance. So many claims, so little actual comparisons.5. In the real world of high performance parallel computing, tree-like structures are non-existent. Arrays are king. And that's because of the physical nature of how memory works on a hardware level. And do you know what works best on mutable contiguous memory buffers ? Loops. We'll see when HVM will implement this.In the end, what we currently have is half-baked language that is (almost) fully isolated from external data, extremely slow, a massive abstraction on the underlying hardware (unutilised features: multilevel caches, tensor cores, simd, atomics).I apologize if this comes out as harsh, I still find the technical implementation and the theoretical background to be very interesting. I'm simply not (yet) convinced of its usefulness in the real world.

评论 #40395356 未加载

评论 #40397032 未加载

delu12 个月前

Ten years ago, I took a course on parallel algorithms (15-210 at CMU). It pitched parallelism as the future of computing as Moore's law would hit inevitable limits. I was sold and I was excited to experiment with it. Unfortunately, there weren't many options for general parallel programming. Even the language we used for class (SML) wasn't parallel (there was a section at the end about using extensions and CUDA but it was limited from what I recall).Since then, I was able to make some experiments with multithreading (thanks Rust) and getting very creative with shaders (thanks Shadertoy). But a general parallel language on the GPU? I'm super excited to play with this!

评论 #40393603 未加载

评论 #40394506 未加载

ziedaniel112 个月前

Very cool idea - but unless I'm missing something, this seems very slow.I just wrote a simple loop in C++ to sum up 0 to 2^30. With a single thread without any optimizations it runs in 1.7s on my laptop -- matching Bend's performance on an RTX 4090! With -O3 it vectorizes the loop to run in less than 80ms.<pre><code> #include <iostream> int main() { int sum = 0; for (int i = 0; i < 1024*1024*1024; i++) { sum += i; } std::cout << sum << "\n"; return 0; }</code></pre>

评论 #40392629 未加载

评论 #40392824 未加载

评论 #40392577 未加载

评论 #40392196 未加载

Arch48512 个月前

I want to congratulate the author on this, it's super cool. Making correct automatic parallelization is nothing to sneeze at, and something you should absolutely be proud of.I'm excited to see how this project progresses.

jjovan112 个月前

Why so much negativity? An angry crowd sounded more like bots trying to test OP's intelligence by exploiting the ReadMe file imperfections while trying to change the context and intent of the post. It's so ignorant and brutal. They spent hours arguing without taking 2 minutes to properly read the ReadMe file.OP is a one man's show and now they all want to piss on OP. Keep going OP!

animaomnium12 个月前

Fala Taelin, nice work! Does HVM2 compile interaction nets to e.g. spirv, or is this an interpreter (like the original HVM) that happens to run on the GPU?I ask because a while back I was messing around with compiling interaction nets to C after reducing as much of the program as possible (without reducing the inputs), as a form of whole program optimization. Wouldn't be too much harder to target a shader language.Edit: Oh I see...> This repository provides a low-level IR language for specifying the HVM2 nets, and a compiler from that language to C and CUDA HVMWill have to look at the code then!<a href="https://github.com/HigherOrderCO/HVM">https://github.com/HigherOrderCO/HVM</a>Edit: Wait nvm, it looks like the HVM2 cuda runtime is an interpreter, that traverses an in-memory graph and applies reductions.<a href="https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f267841a24119ffd569c714d/src/hvm.cu#L346">https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...</a>I was talking about traversing an interaction net to recover a lambda-calculus-like term, which can be lowered to C a la lisp in small pieces with minimal runtime overhead.Honestly the motivation is, you are unlikely to outperform a hand-written GPU kernel for like ML workloads using Bend. In theory, HVM could act as glue, stitching together and parallelizing the dispatch order of compute kernels, but you need a good FFI to do that. Interaction nets are hard to translate across FFI boundaries. But, if you compile nets to C, keeping track of FFI compute kernel nodes embedded in the interaction network, you can recover a sensible FFI with no translation overhead.The other option is implementing HVM in hardware, which I've been messing around with on a spare FPGA.

评论 #40389299 未加载

MrLeap12 个月前

This is incredible. This is the kind of work we need to crack open the under utilized GPUs out there. I know LLMs are all the rage, but there's more gold in them hills.

评论 #40395037 未加载

yetihehe12 个月前

Bend looks like a nice language.> That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Note that, for now, Bend only supports 24-bit machine ints (u24), thus, results are always mod 2^24.Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone not accustomed to gpu's and wanting to solve some problems requiring 64 bits (gravitational simulation of solar system at millimeter resolution could use ~58bit ints for position).

评论 #40391677 未加载

评论 #40392926 未加载

ruste12 个月前

Been watching your development for a while on Twitter. This is a monumental achievement and I hope it gets the recognition it deserves.

zackmorris12 个月前

This is nice, and obvious. I've waited about 20 years since I learned MATLAB and GNU Octave for someone to make a graph solver like this. And about 25 years since I first had the idea, when I was learning VLSI with VHDL in college and didn't see anything like the functional programming of circuits in what at the time was the imperative C++ world. The closest thing then was Lisp, but nobody talked about how the graph representation (intermediate code or i-code in the imperative world) could be solved in an auto-parallelized way.We still see this today in how languages go out of their way to implement higher order method libraries (map/reduce/filter) but then under the hood there is no multithreading, they just expect the developer to annotate their loops to be parallel because the languages aren't formal enough to know about side effects in the innermost logic, and don't support immutability or performant pass-by-value semantics with copy-on-write anyway. So we end up with handwavy languages like Rust that put all of that mental load onto the developer for basically no gain, they just save memory by performing computation in-place imperatively.I also like how Bend sidesteps the nonexistence of highly scaled symmetric multiprocessing CPUs by supporting GPUs. It makes the argument moot that GPUs can't be stopped because they're too big to fail. Julia is the only other language I've seen that tries this. I wish Clojure did, although it's been a long time since I followed it so maybe it has some parallelism?I would have dearly loved to work on something like Bend, had someone solved the funding issue. Nobody wants to pay for pure research, and nobody sees the need for languages that do what everyone else is doing except easier. We have Kickstarter for widgets and Patreon for influencers, but makers have to bootstrap themselves or learn everything about finance or live in the right city or have a large network to hopefully meet an angel investor or work in academia and lose all rights to what they invent while spending the majority of their time hustling for grants anyway. So it just never happens and we're stuck with the same old busted techniques. Like how Hollywood only has money for sequels and reboots or the recording industry only has money for canned corporate music and hits from already famous artists and yet another cover that yanks the original better song off the radio.A quarter of a century can go by in the blink of an eye if you get suckered into building other people's dreams as a people-pleaser. Be careful what you work on.

评论 #40393208 未加载

npalli12 个月前

Is the recursive sum the best function to show multi-threading or GPU speedups? Seems unlikely. FWIW, i ported the python example to Julia and it ran in about 2.5 seconds the same as the C++ version. Pure python 3.12 took 183 seconds.<pre><code> function sum(depth, x) if depth == 0 return x else fst = sum(depth-1, x*2+0) snd = sum(depth-1, x*2+1) end return fst + snd end </code></pre> println(sum(30,0))

notfed12 个月前

This is really, really cool. This makes me think, "I could probably write a high performance GPU program fairly easily"...a sentence that's never formed in my head.

评论 #40392080 未加载

robust-cactus12 个月前

This is awesome and much needed. Keep going, forget the overly pedantic folks, the vision is great and early results are exciting.

anentropic12 个月前

I remember seeing HVM on here a year or two back when it came out and it looked intriguing. Exciting to see something being built on top of it!I would say that the play on words that gives the language its name ("Bend") doesn't really make sense...<a href="https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md">https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md</a>> Bending is the opposite of folding. Whatever fold consumes, bend creates.But in everyday language bending is not the opposite of folding, they are more or less the same thing. Why not "unfold", which also has a connotation of "the process of happening" as well as merely the opposite of folding?I have a question about the example code and output for bending:<pre><code> type Tree: Node { ~lft, ~rgt } Leaf { val } def main(): bend x = 0: when x < 3: tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) } else: tree = Tree/Leaf { val: 7 } return tree tree = fork(0) tree = ![fork(1), fork(1)] tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]] tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]], ![![fork(3),fork(3)], ![fork(3),fork(3)]]] tree = ![![![7,7], ![7,7]], ![![7,7], ![7,7]]] </code></pre> Where does the initial "tree = fork(0)" come from?

评论 #40510225 未加载

评论 #40430561 未加载

评论 #40388815 未加载

mjaniczek12 个月前

I've made a benchmark of Bend running a simple counter program on CPU vs GPU, vs Haskell,Node,Python,C that I plan to write a blogpost about, probably this Sunday:<a href="https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8EesInLeGbxM-oSmPQ8EJOnWLDA/edit?usp=drivesdk" rel="nofollow">https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...</a>It's magical how the GPU version is basically flat (although with a high runtime init cost).

gsuuon12 个月前

Congrats on the HVM2 launch! Been following for a while, excited to see where this project goes. For others who are lost on the interaction net stuff, there was a neat show hn that gave a more hands-on interactive intro: <a href="https://news.ycombinator.com/item?id=37406742">https://news.ycombinator.com/item?id=37406742</a> (the 'Get Started' writeup was really helpful)

klabb312 个月前

This is very exciting. I don’t have any GPU background, but I have been worrying a lot about CUDA cementating itself in the ecosystem. Here devs don’t need CUDA directly which would help decoupling the ecosystem from cynical mega corps, always good! Anyway enough politics..Tried to see what the language is like beyond hello world and found the guide[1]. It looks like a Python and quacks like a Haskell? For instance, variables are immutable, and tree-like divide and conquer data structures/algorithms are promoted for getting good results. That makes sense I guess! I’m not surprised to see a functional core, but I’m surprised to see the pythonic frontend, not that it matters much. I must say I highly doubt that it will make it much easier for Python devs to learn Bend though, although I don’t know if that’s the goal.What are some challenges in programming with these kind of restrictions in practice? Also, is there good FFI options?[1]: <a href="https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md">https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md</a>

评论 #40394173 未加载

davidw12 个月前

As a resident of Bend, Oregon... it was kind of funny to read this and I'm curious about the origin of the name.

评论 #40391751 未加载

评论 #40391648 未加载

评论 #40392342 未加载

评论 #40394472 未加载

api12 个月前

Oh wow do I wish this existed when I was playing with evolutionary computation and genetic algorithms in college…

评论 #40393058 未加载

smusamashah12 个月前

I have no interest in this tech as it's apparently for backend stuff and not actually rendering things by itself.But the demo gif is probably the best I have seen in a Github readme. I watched it till the end. It was instantly engaging. I wanted to see the whole story unfold.

darlansbjr12 个月前

Would a compiler be faster by using HVM? Would love to see a fully parallel version of typescript tsc

yetihehe12 个月前

Wow, Bend looks like a nice language.> That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Note that, for now, Bend only supports 24-bit machine ints (u24), thus, results are always mod 2^24.Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone not accustomed to gpu's and wanting to solve some problems requiring 64 bits (gravitational simulation of solar system at millimeter resolution could use ~58bit ints for position).

评论 #40397030 未加载

评论 #40393172 未加载

anonzzzies12 个月前

Maybe I missed it, but there seems to be no license attached to HVM2, nor to Bend or Kind?

评论 #40388941 未加载

egnehots12 个月前

the interesting comparison nowadays would be against mojo:<a href="https://www.modular.com/max/mojo" rel="nofollow">https://www.modular.com/max/mojo</a>

评论 #40395027 未加载

magnio12 个月前

Congrats on the launch.I know the docs say this will be fixed soon, but what is the main reason for restricting number types to 24 bits? I saw in the code that they are wrapper around the 32-bit system number types, so what prevents Bend from changing them to U32(u32) right now?

评论 #40390082 未加载

throwaway256212 个月前

i-combinators <a href="https://www.semanticscholar.org/paper/Interaction-Combinators-Lafont/6cfe09aa6e5da6ce98077b7a048cb1badd78cc76" rel="nofollow">https://www.semanticscholar.org/paper/Interaction-Combinator...</a>

jes519912 个月前

I think I have a use for this but I’m realizing that I don’t know how to build a mental model of what is going to parallelize in this system. Surely some algorithms are better and getting chopped up than others - how can I tell what is going on?

评论 #40390805 未加载

mattnewport12 个月前

This looks cool, I find myself wishing for a language and introductory tutorial that isn't so targeted at Python programmes however (though I understand from a commercial point of view why that may make sense).It seems like this is actually an elegant typed functional language but the Python syntax looks ugly and verbose and like it's trying to hide that compared to something more ML/F# or Haskell inspired.I'll try and get past that though as it does look like there's something pretty interesting here.

magicalhippo12 个月前

> Everything that can run in parallel, will run in parallel.On the CPU, there's typically a threshold where dividing and coordinating the parallel work takes more time than simply doing the work on a single thread. Thus you can make the overall runtime much faster by not dividing the work all the way, but rather stop at that optimal threshold and then just loop over the remaining work in the worker threads.How does this work on the GPU using Bend? Been too long since I did any GPU programming.

KingOfCoders12 个月前

The website claims "automatically achieves near-ideal speedup"12x for 16x threads51x for 16.000x threadsCan someone point me to a website where it explains that this is the "ideal speedup"? Is there a formula?

评论 #40399946 未加载

评论 #40397937 未加载

评论 #40397006 未加载

mjaniczek12 个月前

I've made a benchmark of the current version of Bend running a simple counter program on CPU vs GPU, vs Haskell,Node,Python,C: <a href="https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8EesInLeGbxM-oSmPQ8EJOnWLDA/edit?usp=drivesdk" rel="nofollow">https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...</a>It's magical how the GPU version is basically flat (although with a high runtime init cost)

highfrequency12 个月前

> CPU, Apple M3 Max, 1 thread: 3.5 minutes> CPU, Apple M3 Max, 16 threads: 10.26 secondsSurprised to see a more than linear speedup in CPU threads. What’s going on here?

评论 #40392826 未加载

评论 #40392651 未加载

hintymad12 个月前

Speaking of parallel computing, any book or series of books that can help an engineer learn parallel program and go from zero to hero? Ideally the books will cover both intuition, in-depth details, and theories. Something like The Art of Multiprocessor Programming by Herlihy et el for concurrent programming, even though the book arguably still has too steep of a learning curve.

mbforbes12 个月前

Congratulations on the launch and hard work so far! We need projects like this. Great readme and demo as well.Every time I try to write shaders, or even peek through my fingers at CUDA C(++) code, I recoil in disbelief that we don't have high level programming yet on the GPU. I can't wait until we do. The more great projects attacking it the better in my book.

评论 #40402646 未加载

i5heu12 个月前

This is some very cool project! I sometime dream about a instruction set architecture (ISA) that runs in some kind of VM that allows for existing languages to be able to run on CPU/GPU/FPGAs/ASICs automatically.I think this is a much more practically approach and i hope this will give some inspiration to this possibility.

runeks12 个月前

What's the expected effort needed for supporting GPUs other than Nvidia — e.g. AMD GPUs or the GPU in a MacBook Pro M1/2/3?As I understand, it's a lot of work because there's no common way to target these different GPUs. Is this correctly understood?

croemer12 个月前

Dupe of <a href="https://news.ycombinator.com/item?id=40387394">https://news.ycombinator.com/item?id=40387394</a> and <a href="https://news.ycombinator.com/item?id=40383196">https://news.ycombinator.com/item?id=40383196</a>

temp12378924612 个月前

Congrats!I’ve been watching HVM for a while and think it’s extremely cool.My intuition is that this will eventually be a really big deal.

britannio12 个月前

Incredible feat, congratulations!

xiaodai12 个月前

Looks cool but what's one toy problem that it can solve more efficiently than others?

评论 #40389641 未加载

评论 #40391074 未加载

flakiness12 个月前

This kind of sound like Mojo. I wonder how these compare? (Besides HVM/Bend being opensource, which is awesome.)<a href="https://www.modular.com/max/mojo" rel="nofollow">https://www.modular.com/max/mojo</a>

zmmmmm12 个月前

Reminds me a little bit of Concurnas [0] which sadly seemed to get abandoned right at the point when it was nearly viable.[0] <a href="https://concurnas.com/" rel="nofollow">https://concurnas.com/</a>

anonzzzies12 个月前

What is the terminal used for that demo? <a href="https://github.com/HigherOrderCO/Bend">https://github.com/HigherOrderCO/Bend</a> does it just skip commands it cannot execute or?

评论 #40389198 未加载

评论 #40389395 未加载

shadowpho12 个月前

Wow this is very impressive!

gigatexal12 个月前

The first graphic midway or so down the page has this tag:tested on: CPU - Apple M3 Max, GPU - NVIDIA RTX 4090But how? I thought eGPUs don’t work on apple silicon and the pci-e having Mac Pro is still M2 based, no?

评论 #40393168 未加载

exitheone12 个月前

This seems pretty cool!Question: Does this take into account memory bandwidth and caches between cores? Because getting them wrong can easily make parallel programs slower than sequential ones.

KeplerBoy12 个月前

What's going on with the super-linear speedup going from one thread to all 16?210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which isn't really expected.

评论 #40391602 未加载

评论 #40393003 未加载

评论 #40391566 未加载

netbioserror12 个月前

So HVM finally yields fruit. I've been eagerly awaiting this day! Bend seems like a very suitable candidate for a Lispy S-expression makeover.

mccoyb12 个月前

This is cool! Is the idea to put Kind2 on top of this in some way?I’d also love to find an example of writing a small interpreter in Bend - which runs on the GPU.

评论 #40389317 未加载

mattdesl12 个月前

Looking forward to using this. Curious about how far away WebGPU/WASM support might be, it could provide a single cross-platform backend.

funny_name12 个月前

What kind of software would this language be good for? I assume it's not the kind of language you'd use for web servers exactly.

评论 #40393767 未加载

zacksiri12 个月前

What would be super interesting is having something library, and able to use something like this inside Elixir or Ruby, to optimize hotspots.

anonzzzies大约 1 年前

This and HVM2 are some of the most interesting work I know off currently. Nice break from all the LLM stuff.Now I just need a Common Lisp implemented using it!

idiomaxiom12 个月前

If folds and bends are isomorphic to loops then loops can be parallelized ala Occam-Pi?I am really enjoying this implementation :)

programjames12 个月前

I just read through the HVM2 and Lafont papers, and I'm pretty impressed with this style of computation!

thinking_banana12 个月前

It's pretty cool that you *actually* built this necessary Developer interface aimed towards accessibility of HPC!

kerkeslager12 个月前

This looks like the language I've wanted for a long time. I'm excited to see how this plays out.

KingOfCoders12 个月前

Has someone written the example in "native" GPU (C/Cuda) to compare performance?

totorovirus12 个月前

I can already see how many people are so illiterate about GPUs.

DrNosferatu12 个月前

Is there an OpenCL backend?

buckley_1991大约 1 年前

Awesome! It looks similar to the problems that Modular AI aims to solve.

wolfspaw12 个月前

Nice!Python-like + High-performance.And, Different from Mojo, its Fully Open-Source.

IsTom12 个月前

Reminds me a lot of the reduceron, sans FPGA.

markush_12 个月前

Exciting project, congrats on the release!

kkukshtel12 个月前

Honestly incredible, and congrats on the release after what looks like an insane amount of work.

toastal12 个月前

Another project locking its communications to the Discord black hole.

naltroc12 个月前

so cool. I see it has a lib target, can we use it as a crate instead of external program?

abeppu12 个月前

I for one found the 'how is this possible' video near the bottom of the page to be unhelpful:- surely for `3 x 3 = 9`, there is some concept of primitive operations?- I get that replacement of patterns in a graph can be done in parallel, but (a) identifying when a rewrite rule should apply and (b) communicating the state of the updated graph to worker threads and (c) organizing worker threads to agree on which does each task all take some effort. When is this more work than the original computation (as in the 3x3 example)?

评论 #40393584 未加载

评论 #40392682 未加载

chc412 个月前

24bit integers and floats, no array datatype, and a maximum 4GB heap of nodes are very harse restrictions, especially for any workloads that would actually want to be running on a GPU. The limitations in the HVM2 whitepaper about unsound evaluation around closures and infinite loops because it is evaluating both sides of a conditional without any short circuiting are also extremely concerning.Before you reply "these are things we can address in the future": that doesn't matter. Everyone can address everything in the future. They are currently hard technical barriers to it's use, with no way of knowing the level of effort that will require or the knock-on effects, especially since some of these issues have been "we can fix that later" for ten years.I also highly recommend changing your benchmark numbers from "interactions per second" to a standard measurement like FLOPS. No one else on earth knows how many of those interactions are pure overhead from your evaluation semantics, and not doing useful work. They come across as attempting to wow an audience with high numbers and not communicating an apples to apples comparison with other languages.

评论 #40393386 未加载

efferifick12 个月前

Thank you for sharing!

exabrial12 个月前

> First, install Rust nightly.Eeek.

JayShower12 个月前

This is really cool!

jgarzon12 个月前

Very nice!

hypersimplex12 个月前

congratz on the release

gingfreecss12 个月前

amazing

Archit3ch12 个月前

Pure functions only? This is disappointing. Furthermore, it invites a comparison with JAX.

tinydev大约 1 年前

This is cool as shit.

3abiton12 个月前

> That's a 57x speedup by doing nothing.Okay, I'll have what you're having.

light_hue_112 个月前

Massive promises of amazing performance but they can't find one convincing example to showcase. It's hard to see what they're bringing to the table when even the simplest possible Haskell code just as fast on my 4 year old laptop with an ancient version of GHC (8.8). No need for an RTX 4090.<pre><code> module Main where sum' :: Int -> Int -> Int sum' 0 x = x sum' depth x = sum' (depth - 1) ((x \* 2) + 0) + sum' (depth - 1) ((x \* 2) + 1) main = print $ sum' 30 0 </code></pre> Runs in 2.5s. Sure it's not on a GPU, but it's faster! And things don't get much more high level.If you're going to promise amazing performance from a high level language, I'd want to see a comparison against JAX.It's an improvement over traditional interaction nets, sure! But interaction nets have always been a failure performance-wise. Interaction nets are PL equivalent of genetic algorithms in ML, they sound like a cool idea and have a nice story, but then they always seem to be a dead end.Interaction nets optimize parallelism at the cost of everything else. Including single-threaded performance. You're just warming up the planet by wasting massive amounts of parallel GPU cores to do what a single CPU core could do more easily. They're just the wrong answer to this problem.

评论 #40393237 未加载

评论 #40393066 未加载

83 条评论

Twirrim12 个月前

评论 #40392519 未加载

评论 #40392925 未加载

评论 #40396573 未加载

评论 #40397638 未加载

评论 #40395872 未加载

评论 #40395867 未加载

vegadw12 个月前

评论 #40393958 未加载

评论 #40393752 未加载

评论 #40396652 未加载

CorrectingYou12 个月前

OP comes around with some of the coolest things posted in HN recently, and all he gets is extensive criticism, when it is clear that this is an early version :/

评论 #40396296 未加载

评论 #40396414 未加载

评论 #40400232 未加载

评论 #40403176 未加载

评论 #40394764 未加载

评论 #40395952 未加载

andrewp12312 个月前

评论 #40395796 未加载

praetor2212 个月前

评论 #40395356 未加载

评论 #40397032 未加载

delu12 个月前

评论 #40393603 未加载

评论 #40394506 未加载

ziedaniel112 个月前

评论 #40392629 未加载

评论 #40392824 未加载

评论 #40392577 未加载

评论 #40392196 未加载

Arch48512 个月前

jjovan112 个月前

animaomnium12 个月前

评论 #40389299 未加载

MrLeap12 个月前

This is incredible. This is the kind of work we need to crack open the under utilized GPUs out there. I know LLMs are all the rage, but there's more gold in them hills.

评论 #40395037 未加载

yetihehe12 个月前

评论 #40391677 未加载

评论 #40392926 未加载

ruste12 个月前

Been watching your development for a while on Twitter. This is a monumental achievement and I hope it gets the recognition it deserves.

zackmorris12 个月前

评论 #40393208 未加载

npalli12 个月前

notfed12 个月前

This is really, really cool. This makes me think, "I could probably write a high performance GPU program fairly easily"...a sentence that's never formed in my head.

评论 #40392080 未加载

robust-cactus12 个月前

This is awesome and much needed. Keep going, forget the overly pedantic folks, the vision is great and early results are exciting.

anentropic12 个月前

评论 #40510225 未加载

评论 #40430561 未加载

评论 #40388815 未加载

mjaniczek12 个月前

gsuuon12 个月前

klabb312 个月前

评论 #40394173 未加载

davidw12 个月前

As a resident of Bend, Oregon... it was kind of funny to read this and I'm curious about the origin of the name.

评论 #40391751 未加载

评论 #40391648 未加载

评论 #40392342 未加载

评论 #40394472 未加载

api12 个月前

Oh wow do I wish this existed when I was playing with evolutionary computation and genetic algorithms in college…

评论 #40393058 未加载

smusamashah12 个月前

darlansbjr12 个月前

Would a compiler be faster by using HVM? Would love to see a fully parallel version of typescript tsc

yetihehe12 个月前

评论 #40397030 未加载

评论 #40393172 未加载

anonzzzies12 个月前

Maybe I missed it, but there seems to be no license attached to HVM2, nor to Bend or Kind?

评论 #40388941 未加载

egnehots12 个月前

the interesting comparison nowadays would be against mojo:<a href="https://www.modular.com/max/mojo" rel="nofollow">https://www.modular.com/max/mojo</a>

评论 #40395027 未加载

magnio12 个月前

评论 #40390082 未加载

throwaway256212 个月前

jes519912 个月前

评论 #40390805 未加载

mattnewport12 个月前

magicalhippo12 个月前

KingOfCoders12 个月前

评论 #40399946 未加载

评论 #40397937 未加载

评论 #40397006 未加载

mjaniczek12 个月前

highfrequency12 个月前

> CPU, Apple M3 Max, 1 thread: 3.5 minutes> CPU, Apple M3 Max, 16 threads: 10.26 secondsSurprised to see a more than linear speedup in CPU threads. What’s going on here?

评论 #40392826 未加载

评论 #40392651 未加载

hintymad12 个月前

mbforbes12 个月前

评论 #40402646 未加载

i5heu12 个月前

runeks12 个月前

croemer12 个月前

temp12378924612 个月前

Congrats!I’ve been watching HVM for a while and think it’s extremely cool.My intuition is that this will eventually be a really big deal.

britannio12 个月前

Incredible feat, congratulations!

xiaodai12 个月前

Looks cool but what's one toy problem that it can solve more efficiently than others?

评论 #40389641 未加载

评论 #40391074 未加载

flakiness12 个月前

zmmmmm12 个月前

anonzzzies12 个月前

What is the terminal used for that demo? <a href="https://github.com/HigherOrderCO/Bend">https://github.com/HigherOrderCO/Bend</a> does it just skip commands it cannot execute or?

评论 #40389198 未加载

评论 #40389395 未加载

shadowpho12 个月前

Wow this is very impressive!

gigatexal12 个月前

评论 #40393168 未加载

exitheone12 个月前

This seems pretty cool!Question: Does this take into account memory bandwidth and caches between cores? Because getting them wrong can easily make parallel programs slower than sequential ones.

KeplerBoy12 个月前

What's going on with the super-linear speedup going from one thread to all 16?210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which isn't really expected.

评论 #40391602 未加载

评论 #40393003 未加载

评论 #40391566 未加载

netbioserror12 个月前

So HVM finally yields fruit. I've been eagerly awaiting this day! Bend seems like a very suitable candidate for a Lispy S-expression makeover.

mccoyb12 个月前

This is cool! Is the idea to put Kind2 on top of this in some way?I’d also love to find an example of writing a small interpreter in Bend - which runs on the GPU.

评论 #40389317 未加载

mattdesl12 个月前

Looking forward to using this. Curious about how far away WebGPU/WASM support might be, it could provide a single cross-platform backend.

funny_name12 个月前

What kind of software would this language be good for? I assume it's not the kind of language you'd use for web servers exactly.

评论 #40393767 未加载

zacksiri12 个月前

What would be super interesting is having something library, and able to use something like this inside Elixir or Ruby, to optimize hotspots.

anonzzzies大约 1 年前

This and HVM2 are some of the most interesting work I know off currently. Nice break from all the LLM stuff.Now I just need a Common Lisp implemented using it!

idiomaxiom12 个月前

If folds and bends are isomorphic to loops then loops can be parallelized ala Occam-Pi?I am really enjoying this implementation :)

programjames12 个月前

I just read through the HVM2 and Lafont papers, and I'm pretty impressed with this style of computation!

thinking_banana12 个月前

It's pretty cool that you *actually* built this necessary Developer interface aimed towards accessibility of HPC!

kerkeslager12 个月前

This looks like the language I've wanted for a long time. I'm excited to see how this plays out.

KingOfCoders12 个月前

Has someone written the example in "native" GPU (C/Cuda) to compare performance?

totorovirus12 个月前

I can already see how many people are so illiterate about GPUs.

DrNosferatu12 个月前

Is there an OpenCL backend?

buckley_1991大约 1 年前

Awesome! It looks similar to the problems that Modular AI aims to solve.

wolfspaw12 个月前

Nice!Python-like + High-performance.And, Different from Mojo, its Fully Open-Source.

IsTom12 个月前

Reminds me a lot of the reduceron, sans FPGA.

markush_12 个月前

Exciting project, congrats on the release!

kkukshtel12 个月前

Honestly incredible, and congrats on the release after what looks like an insane amount of work.

toastal12 个月前

Another project locking its communications to the Discord black hole.

naltroc12 个月前

so cool. I see it has a lib target, can we use it as a crate instead of external program?

abeppu12 个月前

评论 #40393584 未加载

评论 #40392682 未加载

chc412 个月前

评论 #40393386 未加载

efferifick12 个月前

Thank you for sharing!

exabrial12 个月前

> First, install Rust nightly.Eeek.

JayShower12 个月前

This is really cool!

jgarzon12 个月前

Very nice!

hypersimplex12 个月前

congratz on the release

gingfreecss12 个月前

amazing

Archit3ch12 个月前

Pure functions only? This is disappointing. Furthermore, it invites a comparison with JAX.

tinydev大约 1 年前

This is cool as shit.

3abiton12 个月前

> That's a 57x speedup by doing nothing.Okay, I'll have what you're having.

light_hue_112 个月前

评论 #40393237 未加载

评论 #40393066 未加载