<i>The way BERT recognizes that it should pay attention to those words is basically by self-learning on a titanic game of Mad Libs. Google takes a corpus of English sentences and randomly removes 15 percent of the words, then BERT is set to the task of figuring out what those words ought to be. Over time, that kind of training turns out to be remarkably effective at making a NLP model “understand” context</i><p>Great example of "explain it to your grandpa" ability.