I'd love to see more details.<p>Ultimately, it seems like IBM has managed to make a generalized gather/scatter operation over large datasets in this particular task. Yes, this is an "old problem", but at the same time, its the kind of "Engineering advancement" that definitely deserves talk. Any engineer who cares about performance will want to know about memory optimization techniques.<p>As CPUs (and GPUs! And Tensors, and FPGAs, and whatever other accelerators come out) get faster and faster, the memory-layout problem becomes more and more important. CPUs / GPUs / etc. etc. are all getting way faster than RAM, and RAM simply isn't keeping up anymore.<p>A methodology to "properly" access memory sequentially has broad applicability at EVERY level of the CPU or GPU cache.<p>From Main Memory to L3, L3 to L2, L2 to L1. The only place this "serialization" method won't apply is in register space.<p>The "machine learning" buzzword is getting annoying IMO, but there's likely a very useful thing to talk about here. I for one am excited to see the full talk.