I love that as soon as he writes,<p>> The plan was simple.<p>You know you're in for a funny read.<p>More seriously though, the JSON example from a vision language model is interesting but does not take into account how much extrapolation (hallucination) the model will insert over time.<p>For instance, even if not visible in the image, your VLM will probably start inserting details (such as the color of the team's jersey) based on knowing the team's three-letter identifier.<p>So the reliability of the system will go down over time, and it probably compounds if you're using some of that info to feed further steps in the loop.