For the code-minded out there, a "random variable" is something of a lazily evaluated value that can be "sampled" and emit a quantity (or a vector/tensor thereof) each time. And the OP article boils down to the fact that it's generally incorrect to assume that any random variable can be represented solely by its unconditional probability distribution; a distribution is more of a visualization than a sufficient definition. Rather, one must track the entire graph of other random variables that may feed the current one (e.g. that the current one is conditional on), akin to how an Excel spreadsheet models all the dependencies of a cell.<p>The fun part comes when you can ask this computation graph: "what parameters for a random variable early on in the chain would be the ones that optimize some function of variables later in the chain?" And, handwaving a ton of nuance here, when those parameters are weights in a neural network, the function is a loss function on the training data, and the optimization is done by automatic differentiation (e.g. <a href="https://pytorch.org/tutorials/beginner/introyt/autogradyt_tutorial.html" rel="nofollow">https://pytorch.org/tutorials/beginner/introyt/autogradyt_tu...</a>), you have modern AI.<p>If you're interested in the theoretical underpinnings here, Bishop's PRML is perhaps the classic starting point: <a href="https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf" rel="nofollow">https://www.microsoft.com/en-us/research/uploads/prod/2006/0...</a>