TechEcho

We have been assigned the task of designing (and implementing) a framework for distributed optimization in the context of inverse modelling.The target computing env. are both GPU clusters (infiniband) as well as other distributed systems (high latency & prone to error).As for the optimization algorithms it will support both derivative free methods (e.g. VXQR1 or GA) as well as some variation of Gradient Descent.As we would like the framework to be fault-tolerant MPI is not an option(?).What message system would be appropriate - I was thinking of 0mq, but I am getting mixed reactions from the experts.

Distributed optimisation is interesting.Remember that you can take node failure, as long as you have a copy of your state stored somewhere that you can restart from when required. The overall product isn't particularly effected from that.0mq is probably what you want if you're implementing this the way i'm expecting you to. ie, some variation on: take current state, multicast that to all of the machines doing the optimisation take results, unicast back to a machine to determine the best state to seed the next round, repeat.0mq over infiniband will iirc use the native infiniband multicast groups which is very efficient.Keep snapshots of each iteration, and all should be good.But like all things, it depends entirely on the specifics of what you're trying to do. Send me an email, you seem to have an interesting problem.

Ask HN: Which meassage passing system for distributed computing?

1 comment

Ask HN: Which meassage passing system for distributed computing?

1 comment