This brings me back to when I worked on Excel. The formatting structs (and most cell related ones) are so tightly packed and the byte code VM for the formulas is pretty neat.<p>I wish I could have met duanec, that guy wrote probably 10% of the code for Excel. Sadly he had retired 2 years before I started working on Excel.<p>Fun fact: when they were about to ship Excel 1.0, they didn’t know what to name it then so they had an internal poll. The name that garnered the most votes was “Mr Spreadsheet”. Needless to say, marketing vetoed it.
If you're interested in multi-dimensional calculation engines I can also highly recommend this documentary: <a href="https://tm1.film/" rel="nofollow">https://tm1.film/</a><p>Companies like Amazon are still using TM1 for their forecasting. We're trying to build a 21st century version at <a href="https://causal.app" rel="nofollow">https://causal.app</a>
I work as a consultant and spend five out of eight hours a day in Excel, and I want the features in this app so bad.<p>Sadly, the only way I’ll be able to use these features is if Microsoft bought them out and integrated them into Excel. It’s ubiquitous at almost every workplace you go to, but the only innovation of the past 20 years seems to be PQ (which is awesome), and xlookup.
Interesting. A few q's as an Excel performance nerd who's fooled around in Pandas:<p>* Why the design choice to tie the calc engine directly to a DB vs computing dependencies and returning values post-facto?
* How does vectorization scale when using "volatiles" like offset-style formulas that may dynamically change the calc graph?<p>Every time I tried to recreate stuff in Excel in Pandas/Numpy, 1:1 multiplication wasn't the issue, the weird dependencies that prevented using their vectorized capabilities were.
Curious to think why they opted for Go and not C++ for the core engine, where there are lots of batched mathematical calculations.<p>If you want to eventually leverage SIMD and GPGPU computation C++ is almost mandatory, languages like Rust/Nim/Zig are getting there but still behind in many areas.
Can you elaborate some of the use cases for spreadsheets having billions of rows? Databases of phone calls and credit card transactions reach this size, but I would never imagine someone wanting to dump any significant amount of transaction data into a spreadsheet without quite a bit of SQL/R/Py modification ahead of time.
What is your strategy for storage? It seems that because you're offering this as SaaS, your customers both set a high bar for durability of their data and your COGS would be very high if you just kept a sheet this big in memory all the time.<p>Do you have any plans for models which exceed the size that can be reasonably processed on one machine?
Really interesting write up of the performance improvements. I love reading about such huge improvements by paying attention to algorithms and data structures.<p>You have inspired me to try replacing one of the hash tables in an algorithm I'm working on with an array to see what happens. It won't be exactly the same algorithm afterwards, but possibly the benefits will outweigh the disadvantages.
> However, we can create serious bottlenecks when it comes to moving memory back and forth between the CPU and GPU. We’d have to learn what kinds of models would take advantage of this, we have little experience with GPUs as a team—but we may be able to utilize lots of existing ND-array implementations used for training neural nets.<p>Indeed, moving memory back and forth from CPUs to GPUs has an overhead. There are ways to mitigate this though! I vaguely remember that one of the patterns in reducing this movement was to keep the data in the GPU as much as possible. I haven't kept up with the latest tech in GPUs off late. When I first played around with CUDA, ArrayFire (arrayfire.com) (no affiliation) was a promising library, and might be a good fit for your GPU prototypes?
Wait, where is the actual spreadsheet? Everything is hidden behind demo booking.<p>It'd be nice to actually see billions of cells in action. Otherwise it's just marketing for their product disguised as a technical blog post. For all we know it doesn't actually work that well in practice.
Man, this is exactly what I did at a previous job. We had a massive Query engine as a part of our BI product. I've recoded the entire sorting algorithm to use variadic templates to get a sorter for any set of column types in one go, rather than dereferencing a pointer for each column. We've observed massive improvements in speed (more like 50%, rather than 100x mentioned here. However, that codebase has been constantly optimized over the last 25 years).
OMG, does anyone else recall a series of comedy videos about a sysadmin dealing with employees? My favorite was a call from someone who "ran out of rows" in Excel. The sysadmin is losing his mind trying to explain why that wasn't possible.<p>EDIT: Found it! <a href="https://youtu.be/1SNxaJlicEU" rel="nofollow">https://youtu.be/1SNxaJlicEU</a> (slightly NSFW, skip to about 3:30)
I’m curious how incremental updates fit into this? I guess there are relatively structured possible updates like adding a row but one can imagine unstructured updates like updating some parameter that many formulas depend on or (not sure how much causal allows this) much more varied structures/dependencies such that one update may require calculating some small but hard to predict subset of cells. Maybe there’s no need if you can recalculate everything fast.<p>I’m also interested in how causal fits into the taxonomy here: <a href="https://www.scattered-thoughts.net/writing/an-opinionated-map-of-incremental-and-streaming-systems/" rel="nofollow">https://www.scattered-thoughts.net/writing/an-opinionated-ma...</a>
Suggestions for further improvement:<p>Cap the amount of cells and essentially partition. Using 2 machines could conceivably double performance after scaling up stops working. So a 1bn row sum could conceivably ask for 5 200m sums then sum those.<p>Maintain a float32 and float64 version. Compute the float32 first and return to user, then update the result.<p>You could conceivably run some kind of ml model or even regular logic to predict the API call your formula engine will need to make, eg. If the customer status typing a column of sum range, you could optimistically fetch that part as soon as you see that element.<p>Optimistically caching is obviously high variance and resource draining.<p>The 10000iq version is coordination free horizontal scaling with scyladb algorithm
Is there some kind of standard / open source specification for "formulas"<p>I see this concept in a few applications and wondering I'd need to re-invent my own
this looks really cool! I've been reading quite a bit about spreadsheet/calculation engines as I've been building my own as a hobby project recently, and the whole space is pretty interesting.
How do you manage memory on the backend? If you have single spreadsheets taking up GBs it can add up pretty fast. What are your thoughts on scaling this as the business grows?
can someone please point to a similar display/UI/grid engine used in online word processor/spreadsheets? It'll be cool to play with it