Exciting times. The philosophical ramifications of the syntax/semantics distinction is not something people think much about in the main. However, due to GPT et al they will do soon :)<p>More to the point, consistency will improve accuracy in so far as inconsistency is sometimes the cause for inaccuracy. However, being consistent is an extremely low bar. On a basic level even consistency is a problem in natural language where so much depends on usage -- it is near impossible to determine whether sentences are actually negations of each other in the majority of possible cases. But the real problem is truth assignment to valid sentences else we could all just speak Lojban and be done with untruth forever.
Anyone able to provide set of examples that produces latent knowledge and explicitly state what the latent knowledge produced is? If possible, even an basic explanation of the paper would be nice too based on reading other comments in the thread.<p>EDIT/Update: Just found examples from the 10 datasets starting on page 23, that said, even after reviewing these my prior request stands. As far as I am able to guess at this point, this research just models responses across multiple models in a uniform way, which to me makes the claim that this method out performs other methods questionable given it requires existing outputs from other models to aggregate the knowledge across existing models. Am I missing something?
Asked ChatGPT to explain like I’m 5. This is what it produced.<p>“ Okay! Imagine that you have a big robot in your head that knows a lot about lots of different things. Sometimes, the robot might make mistakes or say things that aren't true. The proposed method is like a way to ask the robot questions and figure out what it knows, even if it says something that isn't true. We do this by looking inside the robot's head and finding patterns that make sense, like if we ask the robot if something is true and then ask if the opposite of that thing is true, the robot should say "yes" and then "no." Using this method, we can find out what the robot knows, even if it sometimes makes mistakes.”
Back when I was messing around with LSTM models I was interested in training classifiers to find parts of the internal state that light up when the model is writing a proper name or something like that.<p>Nice to see people are doing similar things w/ transformers.<p>Truth, though, is a bit problematic. The very existence of the word makes it possible for "the truth is out there" to be part of the opening of the TV series the <i>X Files</i>, see <i>Truth Social</i>. I'm sure there is a "truthy" neuron in there somewhere, but one aspect (not the only aspect) of truth is the evaluation of logical formulae (consider the evidence and reasoning process used in court) and when you can do that you run into the problems that Gödel warned you about -- regardless of what kind of technology you used.
This is an important area for AI safety research; see the ELK paper for example.<p><a href="https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge" rel="nofollow">https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s...</a><p>That paper is a bit dense, but considers the ways that a powerful AI model could be intractable/deceptive to discovering its latent knowledge. If we can confidently understand an AI’s internal knowledge/intention states, then alignment is probably tractable.