Scaling our spreadsheet engine from thousands to billions of cells

323 点作者 Lukas1994将近 3 年前

23 条评论

This brings me back to when I worked on Excel. The formatting structs (and most cell related ones) are so tightly packed and the byte code VM for the formulas is pretty neat.I wish I could have met duanec, that guy wrote probably 10% of the code for Excel. Sadly he had retired 2 years before I started working on Excel.Fun fact: when they were about to ship Excel 1.0, they didn’t know what to name it then so they had an internal poll. The name that garnered the most votes was “Mr Spreadsheet”. Needless to say, marketing vetoed it.

评论 #32007301 未加载

评论 #32003735 未加载

评论 #32004597 未加载

评论 #32009727 未加载

评论 #32003289 未加载

评论 #32004423 未加载

评论 #32007033 未加载

评论 #32005744 未加载

评论 #32002817 未加载

评论 #32003480 未加载

Lukas1994将近 3 年前

If you're interested in multi-dimensional calculation engines I can also highly recommend this documentary: <a href="https://tm1.film/" rel="nofollow">https://tm1.film/</a>Companies like Amazon are still using TM1 for their forecasting. We're trying to build a 21st century version at <a href="https://causal.app" rel="nofollow">https://causal.app</a>

评论 #32003214 未加载

greggsy将近 3 年前

I work as a consultant and spend five out of eight hours a day in Excel, and I want the features in this app so bad.Sadly, the only way I’ll be able to use these features is if Microsoft bought them out and integrated them into Excel. It’s ubiquitous at almost every workplace you go to, but the only innovation of the past 20 years seems to be PQ (which is awesome), and xlookup.

评论 #32001411 未加载

评论 #32003124 未加载

评论 #32004338 未加载

评论 #32004076 未加载

评论 #32004007 未加载

评论 #32001469 未加载

评论 #32002139 未加载

theolddirty将近 3 年前

Interesting. A few q's as an Excel performance nerd who's fooled around in Pandas:* Why the design choice to tie the calc engine directly to a DB vs computing dependencies and returning values post-facto? * How does vectorization scale when using "volatiles" like offset-style formulas that may dynamically change the calc graph?Every time I tried to recreate stuff in Excel in Pandas/Numpy, 1:1 multiplication wasn't the issue, the weird dependencies that prevented using their vectorized capabilities were.

cyber_kinetist将近 3 年前

Curious to think why they opted for Go and not C++ for the core engine, where there are lots of batched mathematical calculations.If you want to eventually leverage SIMD and GPGPU computation C++ is almost mandatory, languages like Rust/Nim/Zig are getting there but still behind in many areas.

评论 #32001745 未加载

评论 #32001664 未加载

评论 #32002418 未加载

mirntyfirty将近 3 年前

Can you elaborate some of the use cases for spreadsheets having billions of rows? Databases of phone calls and credit card transactions reach this size, but I would never imagine someone wanting to dump any significant amount of transaction data into a spreadsheet without quite a bit of SQL/R/Py modification ahead of time.

评论 #32002697 未加载

评论 #32003110 未加载

ryanworl将近 3 年前

What is your strategy for storage? It seems that because you're offering this as SaaS, your customers both set a high bar for durability of their data and your COGS would be very high if you just kept a sheet this big in memory all the time.Do you have any plans for models which exceed the size that can be reasonably processed on one machine?

评论 #32001514 未加载

MattPalmer1086将近 3 年前

Really interesting write up of the performance improvements. I love reading about such huge improvements by paying attention to algorithms and data structures.You have inspired me to try replacing one of the hash tables in an algorithm I'm working on with an array to see what happens. It won't be exactly the same algorithm afterwards, but possibly the benefits will outweigh the disadvantages.

difflens将近 3 年前

> However, we can create serious bottlenecks when it comes to moving memory back and forth between the CPU and GPU. We’d have to learn what kinds of models would take advantage of this, we have little experience with GPUs as a team—but we may be able to utilize lots of existing ND-array implementations used for training neural nets.Indeed, moving memory back and forth from CPUs to GPUs has an overhead. There are ways to mitigate this though! I vaguely remember that one of the patterns in reducing this movement was to keep the data in the GPU as much as possible. I haven't kept up with the latest tech in GPUs off late. When I first played around with CUDA, ArrayFire (arrayfire.com) (no affiliation) was a promising library, and might be a good fit for your GPU prototypes?

0xFACEFEED将近 3 年前

Wait, where is the actual spreadsheet? Everything is hidden behind demo booking.It'd be nice to actually see billions of cells in action. Otherwise it's just marketing for their product disguised as a technical blog post. For all we know it doesn't actually work that well in practice.

applgo443将近 3 年前

Man, this is exactly what I did at a previous job. We had a massive Query engine as a part of our BI product. I've recoded the entire sorting algorithm to use variadic templates to get a sorter for any set of column types in one go, rather than dereferencing a pointer for each column. We've observed massive improvements in speed (more like 50%, rather than 100x mentioned here. However, that codebase has been constantly optimized over the last 25 years).

iKnowKungFoo将近 3 年前

OMG, does anyone else recall a series of comedy videos about a sysadmin dealing with employees? My favorite was a call from someone who "ran out of rows" in Excel. The sysadmin is losing his mind trying to explain why that wasn't possible.EDIT: Found it! <a href="https://youtu.be/1SNxaJlicEU" rel="nofollow">https://youtu.be/1SNxaJlicEU</a> (slightly NSFW, skip to about 3:30)

sirjaz将近 3 年前

This needs to be able to be run locally/natively. Most companies would not want to run this with their proprietary data in a third party SaaS

评论 #32004127 未加载

dan-robertson将近 3 年前

I’m curious how incremental updates fit into this? I guess there are relatively structured possible updates like adding a row but one can imagine unstructured updates like updating some parameter that many formulas depend on or (not sure how much causal allows this) much more varied structures/dependencies such that one update may require calculating some small but hard to predict subset of cells. Maybe there’s no need if you can recalculate everything fast.I’m also interested in how causal fits into the taxonomy here: <a href="https://www.scattered-thoughts.net/writing/an-opinionated-map-of-incremental-and-streaming-systems/" rel="nofollow">https://www.scattered-thoughts.net/writing/an-opinionated-ma...</a>

评论 #32015139 未加载

bluelightning2k将近 3 年前

Suggestions for further improvement:Cap the amount of cells and essentially partition. Using 2 machines could conceivably double performance after scaling up stops working. So a 1bn row sum could conceivably ask for 5 200m sums then sum those.Maintain a float32 and float64 version. Compute the float32 first and return to user, then update the result.You could conceivably run some kind of ml model or even regular logic to predict the API call your formula engine will need to make, eg. If the customer status typing a column of sum range, you could optimistically fetch that part as soon as you see that element.Optimistically caching is obviously high variance and resource draining.The 10000iq version is coordination free horizontal scaling with scyladb algorithm

thdxr将近 3 年前

Is there some kind of standard / open source specification for "formulas"I see this concept in a few applications and wondering I'd need to re-invent my own

cyanydeez将近 3 年前

I was just looking at a GIS inside a spreadsheet idea. Have you guys looked into spatial awareness yet?

spankalee将近 3 年前

Most of those billions of values are not shown... why not compute only the values that are on screen?

评论 #32001813 未加载

adamrezich将近 3 年前

this looks really cool! I've been reading quite a bit about spreadsheet/calculation engines as I've been building my own as a hobby project recently, and the whole space is pretty interesting.

评论 #32008261 未加载

评论 #32002506 未加载

评论 #32003068 未加载

potamic将近 3 年前

How do you manage memory on the backend? If you have single spreadsheets taking up GBs it can add up pretty fast. What are your thoughts on scaling this as the business grows?

mandeepj将近 3 年前

can someone please point to a similar display/UI/grid engine used in online word processor/spreadsheets? It'll be cool to play with it

评论 #32003035 未加载

dexwiz将近 3 年前

Looks very similar to ECS-style data storage used in video games. Have you looked at using opaque index-based IDs instead of references?

评论 #32007935 未加载

smm11将近 3 年前

I'm pretty sure this is better addressed with SQL.

评论 #32005563 未加载