The problem with all the very impressive videos on that page is that we have no idea how many attempts were made before the robot could successfully, e.g. "put strawberry into the correct bowl". In that task there's four bowls, so a random choice of bowl would be correct 25% of the time. How many times did the robot put the strawberry e.g. in the bowl of apples? And that's assuming "the correct bowl" is the one with the strawberries (which is a big assumption- the strawberry should go with the others of its kind: says who? How often can the robot put the strawberry in the bowl with the apples if that's what we want it to do?).<p>Plotted results show around 50% average performance on "unseen" tasks, environments, objects etc, which sounds a lot like success follows some kind of random distribution. That's not a great way to engender trust in the "emergent" abilities of a robotic system to generalise to unseen tasks etc. Blame bad statistics if you get a strawberry in the eye, or a banana in the ear.
[2023]<p>Some of the authors have gone on to found a startup called Physical Intelligence: <a href="https://www.physicalintelligence.company/blog/pi0" rel="nofollow">https://www.physicalintelligence.company/blog/pi0</a>
> We represent the robot actions as text strings as shown below. An example of such a string could be a sequence of robot action token numbers: “1 128 91 241 5 101 127 217”.<p>Training with numbers like this might be a little problematic, I have tried to fine tune GPT 4o-mini with very little success(just me?)<p>On the other hand I found[1] Gemini and Molmo being able to locate elements on screen much better than 4o.<p>1. <a href="https://github.com/BandarLabs/clickclickclick">https://github.com/BandarLabs/clickclickclick</a>
Impressive work. Connecting with Nvidia's move to make robotics there next focus, is there need to have powerful compute local to the robot? Cloud latency would seem to be fine for the speed of these robotic arms.