TechEcho

I'm intrigued by the foundational data that powers models like GPT or LLM. Imagine this data, which encapsulates a vast majority of human knowledge, as a random variable Z. Now, let's entertain the idea of another random variable, O, symbolizing the entirety of universal knowledge - a realm beyond our current grasp. Theoretically, infinite sampling from O could help us with novel scientific discovery.When we sample from Z, we obtain strings that serve as training samples for the auto-regressive predictions of models like GPT. Let's further delineate the training process of such a model as a pure function, represented as:W_trained = f(D, W_init, R, S)Where:- D stands for the dataset (samples drawn from Z),- W signifies all the trainable parameters (both their initialized and post-training states),- R is the seed used for any randomness during training,- And S, to simplify, encompasses other requisite states or variables, such as the optimizer's momentum state.The essence is that f is a deterministic function, meaning given the same inputs, it will always produce the same output.Post-training, we employ the function y = g(x, W_trained) for inference. This function g, which was also utilized within f, enables us to interact with another random variable, X. We can think of X as analogous to Z but more in line with real-world test sets or actual user inputs. Consequently, Y represents the outcomes of multiple invocations of g, with each x sampled from X.Building upon this framework, we can conceptualize a Markov chain defined as: O -> Z -> W_trained -> Y. Implicit in this chain is the assumption that X doesn't introduce any additional information to the system.Drawing from the principles of the Data Processing Inequality (DPI) [1] and the concept of mutual information, we can infer:I(O;Y) ≥ I(Z;Y) ≥ I(W_trained;Y)Interpreting this inequality, the mutual information between 'the entirety of universal knowledge' (O) and the output of our trained model (Y) is always greater than or equal to the mutual information between the dataset (Z) and the output (Y). Further, this latter mutual information is always greater than or equal to the mutual information between the trained weights (W_trained) and the output (Y).In simpler terms, the mutual information between the knowledge generated by the trained LLM (Y) and the encompassing universal knowledge (O) will always be constrained by the mutual information that the LLM's trained weights have with the generated knowledge.So, now my question, how can an LLM generate novel insights? Considering the data processing inequality, it seems it shouldn't be able to produce knowledge beyond what's contained in dataset Z, unless I(O;Y) = I(Z;Y), which would suggest that the entirety of universal knowledge is part of the training data.[1] https://en.wikipedia.org/wiki/Data_processing_inequality

2 comments

scantisalmost 2 years ago

I don't know what insight is. So this is more of my opinion to understanding 50% of your question. Just my few cents, maybe it helps you.Assume there is a space of expansion E, where anything in Z can be expanded by additional filler e.g Harry Potter with a dictionary definition after every word. Novel yes, insight no O is never reached.Assume there is a space of remixes and recombinations R in O and not current in Z, this space can be found by LLM e.g Harry Potter vs Moby Dick. That is entirely novel, but from existing material contained in Z. Things that aren't contained in Z can not be recombined or remixed, O is never reached.Assume there is a space of transfers T, where knowledge or structure in a subset of Z can be applied to another subset of Z, creating a new thing, e.g Harry Potter as a Shakespearean sonnet. Again this is novel, but not insight. We can change the structure of a thing sure. Structures not contained in Z should not appear, O is never reached.These are just transformation, based on the structures and object from the existing. Illusive to us, as we neither know all objects or structures contained in Z, as it is already way to vast.So I prompted an LLM...."Dear LLM, write a poem in 3 words of a person using a vending machine. The last word is made up expressing the sound." LLM: Coins clink/jingle Buttons press/beep Snacks zizzle/sizzleNote plural, the misunderstanding, in between slashes it flipped words and me being a bad prompter. My point is this:throw in push rrrrummmsThe structure of two verbs and a made up sound as poem might not be contained in Z. Writing out a sound somewhat autistically is not contained in Z.These type of poems aren't novel either, but fun.undress cover Zzzrrrhhhgrab unfold AachoooThere is a reduced version of O comprised to a Z from that all O can be recombined and found. O can be safely reduced by E R T, Z only needs to contain all objects and structures, not all expansions, recombinations and transfers.In the latter probably lies the problem of LLM eating themselves, when fed with their own output, I guess.

kdmytroalmost 2 years ago

The dataset Z is a collection of facts. The output Y captures the patterns from Z. The LLM can generate novel insights (facts) that satisfy Y without violating the data processing inequality if we consider the patterns to be a part of the information of Z.

Ask HN: Given the Data Processing Inequality, Can LLMs Produce Novel Knowledge?

2 comments

Ask HN: Given the Data Processing Inequality, Can LLMs Produce Novel Knowledge?

2 comments