>The findings reveal that even slight changes in the phrasing of questions can cause major discrepancies in model performance that can undermine their reliability in scenarios requiring logical consistency.<p>Relevant conversation with Yann Lecun:<p><a href="https://www.youtube.com/watch?v=5t1vTLU7s40&t=4189s" rel="nofollow">https://www.youtube.com/watch?v=5t1vTLU7s40&t=4189s</a>