TechEcho

1 comment

VHRangeralmost 2 years ago

Yes, people working on graph based ML realize quickly that the underlying data structures most originally academic libraries (networkX, PyG, etc.) use are bad.I wrote about this before [1] and based a node embedding library around the concept [2].The NetworkX style graphs are laid out as a bunch of items in a heap with pointers to each other. That works at extreme scales, because everything is on a cluster's RAM and you don't mind paying the latency costs of fetch operations. But it makes little sense for graphs with < 5B nodes to be honest.Here's the dream (remains to be implemented by someone):Laying out the graph as a CSR sparse matrix makes way more sense because of data locality. You have an array for edges per node, an index pointer array, and then one matrix with a row per edge for edge data, and one matrix with a row per node for node data. Ideally you code the entire thing with apache arrow memory to ease access to other libraries/languages/At larger scales, you could just leave the CSR array data on NVMe drives, and you'd still operate at 500mb/s random query throughput with hand coded access, ~150mb/s with mmap.[1] <a href="https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-models-fastest-random-walks-on-a-graph/" rel="nofollow noreferrer">https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-...</a>[2] <a href="https://github.com/VHRanger/nodevectors">https://github.com/VHRanger/nodevectors</a>

评论 #36578309 未加载

评论 #36574733 未加载

评论 #36575047 未加载

评论 #36576997 未加载

Vectorizing Graph Neural Networks (2020)

1 comment

Vectorizing Graph Neural Networks (2020)

1 comment