We have been assigned the task of designing (and implementing) a framework for distributed optimization in the context of inverse modelling.<p>The target computing env. are both GPU clusters (infiniband) as well as other distributed systems (high latency & prone to error).<p>As for the optimization algorithms it will support both derivative free methods (e.g. VXQR1 or GA) as well as some variation of Gradient Descent.<p>As we would like the framework to be fault-tolerant MPI is not an option(?).<p>What message system would be appropriate - I was thinking of 0mq, but I am getting mixed reactions
from the experts.
Distributed optimisation is interesting.<p>Remember that you can take node failure, as long as you have a copy of your state stored somewhere that you can restart from when required. The overall product isn't particularly effected from that.<p>0mq is probably what you want if you're implementing this the way i'm expecting you to. ie, some variation on: take current state, multicast that to all of the machines doing the optimisation take results, unicast back to a machine to determine the best state to seed the next round, repeat.<p>0mq over infiniband will iirc use the native infiniband multicast groups which is very efficient.<p>Keep snapshots of each iteration, and all should be good.<p>But like all things, it depends entirely on the specifics of what you're trying to do. Send me an email, you seem to have an interesting problem.