I'd be interested to hear how Llama 8B with long chain-of-thought prompts compares to GPT-4 one-shot prompts for real-world tasks.<p>In classification for example, you could ask Llama 8B to reason through each possibility, rank them, rate them, make counterarguments, etc. - all in the same time that GPT-4 would take to output one classification without reasoning. Which does better?