TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How to share a NumPy array between processes

67 pointsby jasonb05over 1 year ago

5 comments

westurnerover 1 year ago
Though deprecated probably in favor of more of a database&#x2F;DBMS like DuckDB, the arrow plasma store holds handles to objects as a separate process:<p><pre><code> $ plasma_store -m 1000000000 -s &#x2F;tmp&#x2F;plasma </code></pre> Arrow arrays are like NumPy arrays but they&#x27;re made for zero copy e.g. IPC Interprocess Communication. There&#x27;s a dtype_backend kwarg to the Pandas DataFrame constructor and read_ methods:<p>df = pandas.Dataframe(dtype_backend=&quot;arrow&quot;)<p>The Plasma In-Memory Object Store &gt; Using Arrow and Pandas with Plasma &gt; Storing Arrow Objects in Plasma <a href="https:&#x2F;&#x2F;arrow.apache.org&#x2F;docs&#x2F;dev&#x2F;python&#x2F;plasma.html#storing-arrow-objects-in-plasma" rel="nofollow noreferrer">https:&#x2F;&#x2F;arrow.apache.org&#x2F;docs&#x2F;dev&#x2F;python&#x2F;plasma.html#storing...</a><p>Streaming, Serialization, and IPC &gt; <a href="https:&#x2F;&#x2F;arrow.apache.org&#x2F;docs&#x2F;python&#x2F;ipc.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;arrow.apache.org&#x2F;docs&#x2F;python&#x2F;ipc.html</a><p>&quot;DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB&quot; (2021) <a href="https:&#x2F;&#x2F;duckdb.org&#x2F;2021&#x2F;12&#x2F;03&#x2F;duck-arrow.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;duckdb.org&#x2F;2021&#x2F;12&#x2F;03&#x2F;duck-arrow.html</a>
评论 #37304260 未加载
arjvikover 1 year ago
Good content, but over SEO optimized. Would be nice to hear about the actual efficiency of using these methods.<p>For instance, does fork() copy the page of memory containing the array? I believe it&#x27;s Copy-on-Write semantics, right? What happens when the parent process changes the array?<p>Then, how do Pipe and Queue send the array across processes? Do they also pickle and unpickle it? Use shared memory?
评论 #37304354 未加载
评论 #37304884 未加载
pplonski86over 1 year ago
I was searching for similar article. Im working on AutoML python package where I use different packages to train ML models on tabular data. Very often the memory is not properly released by external packages so the only way to manage memeory is to execute training in separate processes.
KeplerBoyover 1 year ago
Actually ran into this problem this week, toyed around with multiprocessing.shared_memory (which seems to also rely on mmaped-files, right?) and decided to just embrace the GIL.<p>Multiprocessing is not needed when all of your handful subprocesses are just calling Numpy-code and release their gil anyways.<p>Also some&#x2F;most Numpy functions are multithreaded (depending on the BLAS implementation, linked against), take advantage of that and schedule huge operations and just let the interpreter sit idle waiting for that result.
评论 #37304918 未加载
schoetbiover 1 year ago
There is also Apache Arrow that uses a similar use case. Maybe this is worth considering: <a href="https:&#x2F;&#x2F;arrow.apache.org&#x2F;docs&#x2F;python&#x2F;memory.html#on-disk-and-memory-mapped-files" rel="nofollow noreferrer">https:&#x2F;&#x2F;arrow.apache.org&#x2F;docs&#x2F;python&#x2F;memory.html#on-disk-and...</a>