A friend of mine is a solo developer, he is creating a big intelligent actors platform using LLMs. I think his platform is overly abstract and use a lot of calls to LLMs. How can one measure the increase in intelligent behavior of this platform versus vanilla GPT4?, I am thinking in same use case that would allow him to show the strength of his idea without having a huge cost.<p>Edited: googling I found this one (<i>), but don't know about the cost of testing the platform.<p>(</i>) https://openreview.net/pdf?id=zAdUB0aCTQ