I tried multiple flavors of llama models, they are all quite dumb. Even the 70b parameter one. It knows about more things which the smaller models just hallucinate when asked, but still cannot do even slightly more complex tasks.<p>I'm also not sure about the current testing methodologies i.e. the 'passed the SAT' hype. Given that the training set already contains much of the information, we should probably compare the AI results with humans having unlimited time and access to the required material.