I'd add 'evreytime you read the data, do something so that next time it's easier. Using java as an example, and pcapng files for example, the first time you read your data, you should at least build a simple block/packet index so that next time you won't have to read it again. Same for all kinds of sparse data. I've had great success using 'simple' things like roaringbitmaps as 'field x is present in block y of stream z' indexes. I save the compressed roaringbitmap(s) in a sqlite DB and next time I open the data file I use it. This can be grafted quite quickly.<p>I realize over the years that the 'load everything' thing is often linked to lack of understanding of machine limitations and little training in stream-processing and how efficient and scalable it is.<p>I'd blame maths teaching that focuses on the abstract operation (full formula) and their 'implementation' past simple examples. Or I'd blame Excel as the gateway drug to numerical computing. But mostly, it's probably 'just' that not that many people happen to encounter data 'that' big (yet it's not 'big' data) and when they do they're often not helped in finding 'progressive' solutions. Running variant/avg isn't hard to understand but you must know it exists... Basic stream processing can be achieved with not too much changes (depending on the algorithms, of course). Simple indexes can be quite easy to build... But often we sell them 'you need to go full DB' or 'this is a job for hadoop or infrastructure-nightmare-tool-of-the-day'. Not everyone points you to <a href="https://ujmp.org/" rel="nofollow">https://ujmp.org/</a> with sparse/dense 1d/2d/3d matrix structures and operations and different storage options (disk-backed, etc...).<p>Most of the time I meet data scientists in difficulty, after 1h of explaining how I do X using roaring bitmaps or sparse structures or after 1h spent building a file/field index using very robust (old) libraries in their language/environment of choice, I see them build pretty solid and scalable pipelines...