Stupid question, but if we want models that are capable of <i>doing</i> things (agents) vs just spitting out interesting content, why isn't anyone training them on data that represents actions?<p>Models are incredible at generating analytical / blog-ish / stack overflowish content, but suck at doing things that are complex enough that they require iteration.<p>For instance: If we want models that can handle complex projects, why don't we record actions taken in the execution of complex projects, and train models on that? Or if we want models that can use a browser competently, why don't we train models on screenshots + action descriptions? (Or is this what was done with o1, which is why it seems to have unprecedented capabilities?)<p>Is the problem just getting high-quality data? I know we've got internet dumps full of blog-ish content, but no big, easy-to-gather dumps of high-quality information about actions or chains of actions and their effects over time<p>(I'm sure there are tons of framing problems in this question -- sorry)
What you're describing isn't how GPT training works. Mostly, they work on <i>next token prediction</i> without having any understanding of what those tokens actually mean. It works well for text and images but it can't lead to a reproducible set of steps.<p>I wrote an article[0] about it recently that you might enjoy.<p>[0] Something From Nothing | A Painless Approach to Understanding AI<p><a href="https://medium.com/gitconnected/something-from-nothing-d755f49d6636" rel="nofollow">https://medium.com/gitconnected/something-from-nothing-d755f...</a>