Nick (the author) reflects on this:
"The data needs to be merged into a format which can be used to train a neural network. Solving this leads to the second, much bigger issue: many of the resulting label to image mappings are inappropriate...".<p>If one instead use a RNN (recurrent neural network) particularly with LSTM, then it can take as input the sequence of all photos from a business and the output of the model would then be the sequence of labels, similar to how translation models work.
Of course then another problem could perhaps be the ordering of labels is unrelated to order of photos for a business, but there is probably some way to handle this by either data synthesis (multiple permutations of the training data to ignore ordering of labels) or by sorting the labels in a certain fashion that the model can learn.