Though deprecated probably in favor of more of a database/DBMS like DuckDB, the arrow plasma store holds handles to objects as a separate process:<p><pre><code> $ plasma_store -m 1000000000 -s /tmp/plasma
</code></pre>
Arrow arrays are like NumPy arrays but they're made for zero copy e.g. IPC Interprocess Communication. There's a dtype_backend kwarg to the Pandas DataFrame constructor and read_ methods:<p>df = pandas.Dataframe(dtype_backend="arrow")<p>The Plasma In-Memory Object Store > Using Arrow and Pandas with Plasma > Storing Arrow Objects in Plasma
<a href="https://arrow.apache.org/docs/dev/python/plasma.html#storing-arrow-objects-in-plasma" rel="nofollow noreferrer">https://arrow.apache.org/docs/dev/python/plasma.html#storing...</a><p>Streaming, Serialization, and IPC >
<a href="https://arrow.apache.org/docs/python/ipc.html" rel="nofollow noreferrer">https://arrow.apache.org/docs/python/ipc.html</a><p>"DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB" (2021)
<a href="https://duckdb.org/2021/12/03/duck-arrow.html" rel="nofollow noreferrer">https://duckdb.org/2021/12/03/duck-arrow.html</a>
Good content, but over SEO optimized. Would be nice to hear about the actual efficiency of using these methods.<p>For instance, does fork() copy the page of memory containing the array? I believe it's Copy-on-Write semantics, right? What happens when the parent process changes the array?<p>Then, how do Pipe and Queue send the array across processes? Do they also pickle and unpickle it? Use shared memory?
I was searching for similar article. Im working on AutoML python package where I use different packages to train ML models on tabular data. Very often the memory is not properly released by external packages so the only way to manage memeory is to execute training in separate processes.
Actually ran into this problem this week, toyed around with multiprocessing.shared_memory (which seems to also rely on mmaped-files, right?) and decided to just embrace the GIL.<p>Multiprocessing is not needed when all of your handful subprocesses are just calling Numpy-code and release their gil anyways.<p>Also some/most Numpy functions are multithreaded (depending on the BLAS implementation, linked against), take advantage of that and schedule huge operations and just let the interpreter sit idle waiting for that result.
There is also Apache Arrow that uses a similar use case. Maybe this is worth considering: <a href="https://arrow.apache.org/docs/python/memory.html#on-disk-and-memory-mapped-files" rel="nofollow noreferrer">https://arrow.apache.org/docs/python/memory.html#on-disk-and...</a>