Yes, people working on graph based ML realize quickly that the underlying data structures most originally academic libraries (networkX, PyG, etc.) use are bad.<p>I wrote about this before [1] and based a node embedding library around the concept [2].<p>The NetworkX style graphs are laid out as a bunch of items in a heap with pointers to each other. That works at extreme scales, because everything is on a cluster's RAM and you don't mind paying the latency costs of fetch operations. But it makes little sense for graphs with < 5B nodes to be honest.<p>Here's the dream (remains to be implemented by someone):<p>Laying out the graph as a CSR sparse matrix makes way more sense because of data locality. You have an array for edges per node, an index pointer array, and then one matrix with a row per edge for edge data, and one matrix with a row per node for node data. Ideally you code the entire thing with apache arrow memory to ease access to other libraries/languages/<p>At larger scales, you could just leave the CSR array data on NVMe drives, and you'd still operate at 500mb/s random query throughput with hand coded access, ~150mb/s with mmap.<p>[1] <a href="https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-models-fastest-random-walks-on-a-graph/" rel="nofollow noreferrer">https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-...</a><p>[2] <a href="https://github.com/VHRanger/nodevectors">https://github.com/VHRanger/nodevectors</a>