I'm looking for a tool that surely exists but I can't seem to find it. I have thousands of small video files, and I'd like to process each one independently across a cluster. The processing can defn benefit from GPU for some parts of the workload (e.g. object detection); it is all pytorch code. Ideally, would like to use cloud computing for this (so EKS would be preferable but can also spin up machines and setup slurm, etc. if need be). I've looked into GNU Parallel, Dask and PySpark. I'm a bit frustrated that most examples assume you have numpy arrays/csvs at the start .. surely there exists a tool that deals with this at the "files" and "scripts" level? I feel the closest match I have seen is Hadoop but again, one where I can chain scripts together.
My immediate reaction was "why not Parallel?" but realized I'm not immediately sure how to deal with multiple servers... Off the top of my head.... have Parallel manage threads that execute remote scripts (using SSH)? SCP to copy files as needed to servers... You would have to figure out how to manage the utilization of resources (i.e. the servers) to not overload them and stuff like that.<p>using Dask is also possible. I have a Dask based system I built that farms out jobs/objects to members in a simple cluster I set up by hand. The Python class is responsible for managing an external process (such as the Java based Stanford NLP toolkit) or a Python script that uses spaCy. Each job gets sent a block of text which is them turned into features by whatever tool is being used. This class uses the 'subprocess' library in Python to deal with the external processes. Dead simple multiprocessing on a cluster w/o the complexities of SLURM. Setting up the venv on the servers in the cluster is the main hassle, but Dask works fine. Send me a PM if you want a copy of this class to get an idea of how it works.