<i>> Simply put, it is nicer to build your systems so that, as much as possible, they use a constant amount of memory irrespective of the input size </i><p>Really good advice - this is a hard earned lesson for many folks. I've worked with quite a few data scientists who were brilliant at experimental design but not necessarily experts in the field of comp sci. Their relatively simple python scripts would run nice and fast initially. As time passed and the organization grew, their scripts would start to run slower and slower as the datasets scaled and swapping to disk started occurring, etc. In some cases they would completely lock up shared machines, taking a good chunk of the team offline for a bit.<p>Anyway, Daniel Lemire's blog is a fantastic resource. I highly recommend taking a look through his publications and open source contributions. I was able to save my employer <i>a lot</i> of money by building on time series compression algorithms [1] and vectorized implementations [2][3] that he has provided.<p>[1] Decoding billions of integers per second through vectorization
<a href="https://onlinelibrary.wiley.com/doi/full/10.1002/spe.2203" rel="nofollow">https://onlinelibrary.wiley.com/doi/full/10.1002/spe.2203</a><p>[2] <a href="https://github.com/lemire/FastPFor" rel="nofollow">https://github.com/lemire/FastPFor</a><p>[3] <a href="https://github.com/searchivarius/PyFastPFor" rel="nofollow">https://github.com/searchivarius/PyFastPFor</a>
I didn't find the solutions to be all that helpful; it was essentially "pipeline your processing." If all I'm doing is a programmatic "cat | grep | cut", that's fine. It gets messy when I want to sort the data or join it with another medium-large dataset, and this is why people load into memory in the first place. Is there a sane path where I'm not immediately jumping to Hadoop or Presto? Maybe I can offload the join to SQLite?
> To be fair, if the rest of your pipeline runs in the megabytes per second, then memory allocation might as well be free from a speed point of view.<p>This is important, but I think sometime people struggle with this sort of thinking. I struggled to explain a similar concept to a junior engineer recently. He was very keen to try to optimize part of a process that wasn't the bottleneck. I tried a couple approaches, like benchmarking various parts under different conditions, modeling it to calculate how speeding up different components would take.<p>I wasn't convincing, unfortunately, so he implemented some changes that sussessfully sped up one part but didn't improve end to end performance. I think sometimes you need to see it with your own eyes.
I'd add 'evreytime you read the data, do something so that next time it's easier. Using java as an example, and pcapng files for example, the first time you read your data, you should at least build a simple block/packet index so that next time you won't have to read it again. Same for all kinds of sparse data. I've had great success using 'simple' things like roaringbitmaps as 'field x is present in block y of stream z' indexes. I save the compressed roaringbitmap(s) in a sqlite DB and next time I open the data file I use it. This can be grafted quite quickly.<p>I realize over the years that the 'load everything' thing is often linked to lack of understanding of machine limitations and little training in stream-processing and how efficient and scalable it is.<p>I'd blame maths teaching that focuses on the abstract operation (full formula) and their 'implementation' past simple examples. Or I'd blame Excel as the gateway drug to numerical computing. But mostly, it's probably 'just' that not that many people happen to encounter data 'that' big (yet it's not 'big' data) and when they do they're often not helped in finding 'progressive' solutions. Running variant/avg isn't hard to understand but you must know it exists... Basic stream processing can be achieved with not too much changes (depending on the algorithms, of course). Simple indexes can be quite easy to build... But often we sell them 'you need to go full DB' or 'this is a job for hadoop or infrastructure-nightmare-tool-of-the-day'. Not everyone points you to <a href="https://ujmp.org/" rel="nofollow">https://ujmp.org/</a> with sparse/dense 1d/2d/3d matrix structures and operations and different storage options (disk-backed, etc...).<p>Most of the time I meet data scientists in difficulty, after 1h of explaining how I do X using roaring bitmaps or sparse structures or after 1h spent building a file/field index using very robust (old) libraries in their language/environment of choice, I see them build pretty solid and scalable pipelines...
At work we are batch-processing a decent amount of data every night using Spark. The way the workload can be split into parts allows the following strategy:<p>- Generate jobs only containing metadata and distribute those jobs to workers. There the actual data that is required for that specific job is queried on the fly from the db.<p>For a similar task however we later switched to the following strategy:<p>- Load all data to be processed at once, use all sorts of distributed transformations and aggregations on the full data set and do the splitting at the end of the workflow.<p>The reason why we switched? The only real reason was that it seemed more the "Big Data style" to do stuff. With the first approach we actually would not need all the fancy Spark functionality right? We would only abuse the framework for a fancy way to distribute mostly independent workloads.<p>However, I very much regretted that decision later as it made our life harder in many ways. For example, I could easily execute and thus debug one of the former jobs locally within a minute. Try that when the workflow is designed in a way that it needs to load several gigabytes before applying any logic. To be fair, the total load on the db was somewhat lower using the second approach, but that just wasn't worth it.
This is nothing new. Back in the late 2000s, I did this to deal with a huge quantity of XML been sent over the wire to desktop PCs. The original version needed 20GB of RAM. I changed it just to pick up the tags needed and parsed the stream as it came. Time was massively reduced too.<p>I see the same mistakes done with JSON nowadays.<p>Basically, if you don’t need the DOM in its entirety, don’t parse it all.
This is in my experience the most common performance antipattern. What makes it even worse is the fact that people quite often do this on the name of performance. Which rarely is a good idea.