Skimming through studies like this, it strikes me that LLM inquiry is in its infancy. I’m not sure that the typical tools & heuristics of quantitative science are powerful enough.<p>For instance, some questions on this particular study:<p>- Measurements and other quantities are cited here with anywhere between 2 and 5 significant figures. Is this enough? Can these say anything meaningful about a set of objects which differ by literally billions (if not trillions) of internal parameters?<p>- One of prompts in second set of experiments replaces the word “person” (from the first experiment) with the word “burglar”. This is a major change, and one that was unnecessary as far as I can tell. I don’t see any discussion of why that change was included. How should experiments control for things like this?<p>- We know that LLMs can generate fiction. How do we detect the “usage” of the capability and control for that in studies of deception?<p>A lot of my concerns are similar to those I have with studies in the “soft” sciences. (Psychology, sociology, etc.) However, because an LLM is a “thing” - an artifact that can be measured, copied, tweaked, poked and prodded without ethical concern - we could do more with them, scientifically and quantitatively. And because it’s a “thing”, casual readers might implicitly expect a higher level of certainty when they see these paper titles.<p>(I don’t give this level of attention to all papers I come across, and I don’t follow this area in general, so maybe I’ve missed relevant research that answers some of these questions.)