Can someone explain this like I am 5. What are the use cases when it says works on images, text etc? Why is this a big deal? What's the human input here? And what output to expect?<p>From what I understand, human validation (supervision) is not happening while algorithm is training on data. Is that right? Will this be open to the public via standard ML frameworks or proprietary?
Basically they cut out a part of the input and make the network predict the missing part. (edit: they actually predict the average of all features). This works for images, audio, text. This produces high quality feature representations for data which can be used to build specialised networks on. The two main tricks are:<p>1. Do the cutout in feature space, not the original input space. (edit: cutout is actually in input space)<p>2. The above would likely just collapse the features to 0, so they use the same network that does the reconstruction to produce the features (!). In their own words:<p>"We first encode a masked version of the training sample (model in <i>student mode</i>) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in <i>teacher mode</i>)"
It seems that they pass everything through an autoencoder first, and a different network tries to predict from a partially masked input the "correct" autoencoder latent space representation of the unmasked input. If it works, the decoder of the autoencoder can generate(guess) the unmasked data from the latent space.
Is there progress on general structured/relational/graphed data modalities?<p>In practice, you spend time and expertise to reform the data into previously-known-to-work form.<p>FWIW, our datasets are huge, with dense data/noise ratio.