I have a couple of projects in my company where wer are creating AI agents to generate code and/or help people in designing software. The agents themselves are conversational. The code generated is most often UI code.<p>How are people going about evaluating the responses of AI agents these days? Particularly for conversational flows - the problem seems more complex because it could require keeping the entire conversation in context.<p>Any help or resources will be quite appreciated!