I'm really interested in Mojo not for its AI applications, but as an alternative to Julia for high performance computing. Like Julia, Mojo is also attempting to solve the two-language problem, but I like that Mojo is coming at it from a Python perspective rather than trying to create new syntax. For better or for worse, Python is absolutely dominating in the field of scientific computing, and I don't see that changing anytime soon. Being able to write optimizations at a lower level in a Python-like syntax is really appealing to me.<p>Furthermore, while I love Julia the language, I'm disappointed in how it really hasn't taken off in adoption by either academia or industry. The community is small and that becomes a real pain point when it comes to tooling. Using the debugger is an awful experience and the VSCode extension that is recommended way to write Julia is very hit-or-miss. I think it would really benefit from a lot more funding that doesn't actually seem to be coming. It's not a 1-to-1 comparison, but Modular has received 3 times the amount of funding as JuliaHub despite being much younger.
At least they included numpy in this one. On their last post, after all their optimizations, numpy.matmul() produced almost the exact same throughput as their most optimized example. Would still need to dig in to see if this one has issues. Benchmarks are always such a minefield.
I'm pretty excited about Mojo and have been keeping an eye on it's development. I feel like the team has learned a lot from their experience, and are taking the best from languages like Python, Rust, Swift, Hylo (Formerly known as Val), and are taking a really nice pragmatic approach in implementing them so that the language is <i>approachable</i>, but also very safe and fast. Once it's out, I hope someone sits down and makes a SwiftUI-like cross platform UI library with it ;).
35Kx speedup is not scaled speedup. Throw this, naively parallelizable task at a bigger computer and get 70kx speedup, etc.<p>While i think there are tons of optimizations to be done for python (looking at you GIL) giving access to low level cpu primitives is not one I think that will be broadly adopted by the python community. That's one of the joys of python: system agnostic, looks pretty close to pseudocode, coding. If you want speed, glue together a bunch of compiled code calls, and hope the call overhead isn't too large. Or write cpu intensive operations in numba, or pyrex. At the end of the day, mojo's pay to play programming language harkens back to the early 90's Borland days.
Mojo needs to demonstrate Hugging Face's AI libraries with Mojo acceleration. Nothing else will have the kind of impact that would have.<p>Throw a half dozen engineers at it, develop a deployment plan for SD XL, profit.<p>You'll get a ton of open source developers working on improving the Mojo versions even further once you release it, researchers developing extensions, etc. GO TO WHERE THE DEVELOPERS ARE.<p>Stable Diffusion is crazy compute heavy, so if Mojo is what it's purported to be, it should be possible to get speedups.
I don't understand the play here for Modular. If this is a worthwhile improvement that is broadly applicable, won't it at some point make it's way into Python, numpy, etc?<p>In Java land we had a bunch of other JVMs over the years offering better performance. Most important things got absorbed into what is now OpenJDK, and the other JVMs, if they even exist at all, are niche players.<p>Performance is a huge focus in Python and ML lands right now, so why would this be any different?
Cool, but it has very little to do with Python, except some similar looking syntax.<p>So for a Python programmer with a performance problem, it doesn't look like a solution.
I just want to see real un-hyped benchmarks. Comparing random Python native code makes no sense and seems dishonest, deterring me from actually trying out the tool.<p>I want a Python that can statically plan underlying GPU allocations, avoids CUDA kernel dispatch overhead and enables a multi-GPU API that isn't some multiprocessing abomination.
As a high-performance computing person, I'm usually I/O bound, not compute bound. I wish someone would come up with a 10x speed up for disk and network I/O.
So TL;DR: Using SIMD and multithreading is faster than doing no optimization in python. The only real comparison here is when not doing any optimization is:<p>> The above code produced a 90x speedup over Python and a 15x speedup over NumPy as shown in the figure below:<p>Am I missing something?
I don’t understand this from a goals perspective. What is an “AI compiler” - and why aren’t they comparing benchmarks with technologies more commonly used in AI?<p>I think I should be impressed, but I feel like I’m missing the point.