This is a bad and needlessly complicated solution because the authors don't know anything about ML.<p>It's both incredibly slow and it will perform poorly (in the sense of the final accuracy and the number of training steps) because you can't just software engineer your way through ML.<p>The important insight is that you do not need all patches from every image! Actually, showing patches like this in sequence is extremely detrimental to training. The network sees far too much data that is too similar in too many big chunks. You want random patches from random images, the more mixed up the patches the better.<p>So knowing this, when you look at their latency equation there's a different and obvious solution. Split the loading process in two steps that run in parallel.<p>First step constantly downloads new images from the web and swaps old images out.
Second step picks an image at random from the ones that are available and generates a random patch from it.<p>The first step is network bound. The second step is CPU bound. The second step always has images available, it never waits for the first, just picks another random image and random patch. You get great resource utilization out of this.<p>That's it. No other changes needed. Just use an off the shelf fast image loader. No need for a cluster.<p>This is a huge waste of engineering time and ongoing computing resources for what is a simple ML problem, had anyone with any knowledge been around.<p>Hey, tweag! If you want to do ML, reach out. :) You can do far better than this!
> Though it has Python bindings, OpenSlide is implemented in C and reads files using standard OS file handlers, however our data sits on cloud storage that is accessible via HTTP. This means that, to open a WSI file, one needs to first download the entire file to disk, and only then can they load it with OpenSlide. But then, what if we need to read tens of thousands of WSIs, a few gigabytes each? This can total more than what a single disk can contain. Besides, even if we mounted multiple disks, the cost and time it would take to transfer all this data on every new machine would be too much. In addition to that, most of the time only a fraction of the entire WSI is of interest, so downloading the entire data is inefficient. A solution is to read the bytes that we need when we need them directly from Blob Storage. fsspec is a Python package that allows us to define “abstract” filesystems, with a custom implementation to list, read and write files. One such implementation, adlfs, works for Azure Blob Storage.<p>AWS S3 has byte-range fetches specifically for this use case [1]. This is quite handy for data lakes and OLAP databases. Apparently, 8 MB and 16 MB are good sizes for typical workloads [2].<p>[1] <a href="https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html" rel="nofollow noreferrer">https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing...</a><p>[2] <a href="https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.pdf" rel="nofollow noreferrer">https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.p...</a>
Because of course there's so little to worry about with storing vast reams of medical data from real people in cloud systems (that surely never get breached) to be accessed by AIs that surely will never create data privacy problems from all the ML vacuuming they rely on....
> But as it turns out, we can’t use it.<p>Although it has Python bindings, OpenSlide is implemented in C and reads files using standard OS file handlers, however our data sits on cloud storage that is accessible via HTTP.<p>This is a self-inflicted problem. Very typical of people who don't know how storage works / what functionality is available will often push themselves into an imaginary corner.<p>Why of all things use HTTP for this?<p>No, of course you don't need to download the whole file to read it.<p>"standard OS file handlers" -- this is a strong indicator that the person writing this doesn't understand how their OS works. What standard are we talking about? Even if "standard" here is used to mean "some common way" -- then which one? How the files are opened? And so on. The author didn't research the subject at all, and came up with an awful solution (vendor lock-in) as a result.
What does 'at scale' mean here and why would anyone need 'the cloud' ? Medical images aren't like cell phone videos where everyone is creating data all the time. There is only so much medical data being created because the machines to create them are limited.