I'm intrigued by the foundational data that powers models like GPT or LLM. Imagine this data, which encapsulates a vast majority of human knowledge, as a random variable Z. Now, let's entertain the idea of another random variable, O, symbolizing the entirety of universal knowledge - a realm beyond our current grasp. Theoretically, infinite sampling from O could help us with novel scientific discovery.<p>When we sample from Z, we obtain strings that serve as training samples for the auto-regressive predictions of models like GPT. Let's further delineate the training process of such a model as a pure function, represented as:<p>W_trained = f(D, W_init, R, S)<p>Where:<p>- D stands for the dataset (samples drawn from Z),<p>- W signifies all the trainable parameters (both their initialized and post-training states),<p>- R is the seed used for any randomness during training,<p>- And S, to simplify, encompasses other requisite states or variables, such as the optimizer's momentum state.<p>The essence is that f is a deterministic function, meaning given the same inputs, it will always produce the same output.<p>Post-training, we employ the function y = g(x, W_trained) for inference. This function g, which was also utilized within f, enables us to interact with another random variable, X. We can think of X as analogous to Z but more in line with real-world test sets or actual user inputs. Consequently, Y represents the outcomes of multiple invocations of g, with each x sampled from X.<p>Building upon this framework, we can conceptualize a Markov chain defined as: O -> Z -> W_trained -> Y. Implicit in this chain is the assumption that X doesn't introduce any additional information to the system.<p>Drawing from the principles of the Data Processing Inequality (DPI) [1] and the concept of mutual information, we can infer:<p>I(O;Y) ≥ I(Z;Y) ≥ I(W_trained;Y)<p>Interpreting this inequality, the mutual information between 'the entirety of universal knowledge' (O) and the output of our trained model (Y) is always greater than or equal to the mutual information between the dataset (Z) and the output (Y). Further, this latter mutual information is always greater than or equal to the mutual information between the trained weights (W_trained) and the output (Y).<p>In simpler terms, the mutual information between the knowledge generated by the trained LLM (Y) and the encompassing universal knowledge (O) will always be constrained by the mutual information that the LLM's trained weights have with the generated knowledge.<p>So, now my question, how can an LLM generate novel insights? Considering the data processing inequality, it seems it shouldn't be able to produce knowledge beyond what's contained in dataset Z, unless I(O;Y) = I(Z;Y), which would suggest that the entirety of universal knowledge is part of the training data.<p>[1] https://en.wikipedia.org/wiki/Data_processing_inequality