I feel like this is the perfect application of running the data multiple times.<p>Imagine having ~10-100 different LLMs, maybe some are medical, maybe some are general, some are from a different language. Have them all run it, rank the answers.<p>Now I believe this can further be amplified by having another prompt ask to confirm the previous answer. This could get a bit insane computationally with 100 original answers, but I believe the original paper I read was that by doing this prompt processing ~4 times, they got to some 95% accuracy.<p>So 100 LLMs give an answer, each time we process it 4 times, can we beat a 64 year old doctor?