The comments in interpretability read like science fiction to me. There's paragraphs on DV3 explaining other models and itself and the emergent properties that appear with bigger models. There's so much commented out related to functional explainability and counterfactual generations.<p>"we asked DV3 for an explanation. DV3 replied that it detected sarcasm in the review, which it interpreted as a sign of negative sentiment. This was a surprising and reasonable explanation, since sarcasm is a subtle and subjective form of expression that can often elude human comprehension as well. However, it also revealed that DV3 had a more sensitive threshold for sarcasm detection than the human annotator, or than we expected -- thereby leading to the misspecification.<p>To verify this explanation, we needed to rewrite the review to eliminate any sarcasm and see if DV3 would revise its prediction. We asked DV3 to rewrite the review to remove sarcasm based on its explanation. When we presented this new review to DV3 in a new prompt, it correctly classified it as positive sentiment, confirming that sarcasm was the cause of the specification error."<p>The published paper instead says "we did not test for the ability to understand sarcasm, irony, humor, or deception, which are also related to theory of mind" .<p>The main conclusion I took away from this is "the remarkable emergence of what seems to be increasing functional explainability with increasing model scale". I can see the reasoning for why OpenAI decided not to publish any more details about the size or steps to reproduce their model. I assumed we would need a much bigger model to see these level of "human" understanding from LLMs. I can respect Meta, Google, and OpenAI's decision, but I hope this accelerates the research into truly open source models. Interacting with these models shouldn't be locked behind corporate doors.