TechEcho

5 comments

westurnerover 1 year ago

Though deprecated probably in favor of more of a database/DBMS like DuckDB, the arrow plasma store holds handles to objects as a separate process:<pre><code> $ plasma_store -m 1000000000 -s /tmp/plasma </code></pre> Arrow arrays are like NumPy arrays but they're made for zero copy e.g. IPC Interprocess Communication. There's a dtype_backend kwarg to the Pandas DataFrame constructor and read_ methods:df = pandas.Dataframe(dtype_backend="arrow")The Plasma In-Memory Object Store > Using Arrow and Pandas with Plasma > Storing Arrow Objects in Plasma <a href="https://arrow.apache.org/docs/dev/python/plasma.html#storing-arrow-objects-in-plasma" rel="nofollow noreferrer">https://arrow.apache.org/docs/dev/python/plasma.html#storing...</a>Streaming, Serialization, and IPC > <a href="https://arrow.apache.org/docs/python/ipc.html" rel="nofollow noreferrer">https://arrow.apache.org/docs/python/ipc.html</a>"DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB" (2021) <a href="https://duckdb.org/2021/12/03/duck-arrow.html" rel="nofollow noreferrer">https://duckdb.org/2021/12/03/duck-arrow.html</a>

评论 #37304260 未加载

arjvikover 1 year ago

Good content, but over SEO optimized. Would be nice to hear about the actual efficiency of using these methods.For instance, does fork() copy the page of memory containing the array? I believe it's Copy-on-Write semantics, right? What happens when the parent process changes the array?Then, how do Pipe and Queue send the array across processes? Do they also pickle and unpickle it? Use shared memory?

评论 #37304354 未加载

评论 #37304884 未加载

pplonski86over 1 year ago

I was searching for similar article. Im working on AutoML python package where I use different packages to train ML models on tabular data. Very often the memory is not properly released by external packages so the only way to manage memeory is to execute training in separate processes.

KeplerBoyover 1 year ago

Actually ran into this problem this week, toyed around with multiprocessing.shared_memory (which seems to also rely on mmaped-files, right?) and decided to just embrace the GIL.Multiprocessing is not needed when all of your handful subprocesses are just calling Numpy-code and release their gil anyways.Also some/most Numpy functions are multithreaded (depending on the BLAS implementation, linked against), take advantage of that and schedule huge operations and just let the interpreter sit idle waiting for that result.

评论 #37304918 未加载

schoetbiover 1 year ago

There is also Apache Arrow that uses a similar use case. Maybe this is worth considering: <a href="https://arrow.apache.org/docs/python/memory.html#on-disk-and-memory-mapped-files" rel="nofollow noreferrer">https://arrow.apache.org/docs/python/memory.html#on-disk-and...</a>

How to share a NumPy array between processes

5 comments

How to share a NumPy array between processes

5 comments