How are you planning to train a model based on annotated actions done on a page? Wouldn’t it be more practical to have a model like PIX2ACT that understands the action to be performed and does it?
Or are you planning to compose future actions from decomposing flows?
I don’t want to be rude as I assume English is not the first language of the people behind this project but certainly something like ChatGPT could clean up this text.<p>Not that I love text fully generated my AI but it’s got to be better than what’s they have currently.