A lot of folks on here are saying that this is cool but useless because there are better ways to click a button on a screen. If you read through their paper (<a href="http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf" rel="nofollow">http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2...</a>) you'll find more practical examples of what can be done with this type of system.<p>One such example is to track real time images of a webcam pointed at a baby and using Sikuli to watch for a yellow dot placed on the baby's forehead. Another is to track movements of something across the screen; in this case a bus moving along Google maps.<p>I agree that there are better ways to do most of the things in their examples and that they should probably re-work their videos a bit, but just because this system doesn't solve your problems the way you want it to doesn't mean its useless.
Hm, makes me think of <a href="http://blog.objectmentor.com/articles/2008/06/22/observations-on-test-driving-user-interfaces" rel="nofollow">http://blog.objectmentor.com/articles/2008/06/22/observation...</a> .<p>It sounds... scary. Like it will work well enough at first, and then explode when someone changes their desktop theme (especially icon theme), or wants to upgrade to a new version of whatever. Treating things as change-controlled APIs when they aren't just seems dangerous. Still I guess there's at least some amount of change control coming from platform conventions and human interface guidelines, and this comes closer to operating at the correct level of abstraction to benefit from that.
Sometimes I see failures of HN to collaboratively discover this kind of interesting topics. I've posted the paper much earlier than the media coverage (more than 100 days ago): <a href="http://news.ycombinator.com/item?id=810986" rel="nofollow">http://news.ycombinator.com/item?id=810986</a> and there is no vote up.
The people dissing this have obviously never dreamed of automating a 16-bit Visual Basic 3 Windows app (that's Win16, not Win32) so it can be run from a webapp front-end and gradually obsolesced.<p>Autohotkey works, but matching by screenshots with computer vision would cut the amount of work required in half.<p>Bravo!
I can imagine this being useful for knowing to stop when things start going wrong. One problem I've had with GUI-automators in the past is that they've just kept automating after something unexpected happened and put them into an invalid state. It seems like Sikuli could avoid this by literally knowing when the screen looks wrong.
Rube Goldberg would be proud.<p>(Sometimes you should take a step back and ask yourself, "is looking for pictures on the screen <i>really</i> the best way to do this"? The example they show on the main page is a one-line "ifconfig" invocation, for example.)
Two wonderful things about this:
1. as frankenstein-ed together as the tech is, it works*
2. this is arguably more natural than 'workflow' recording functionality like automator, and I found the actual 'code' highly readable(although inscrutably hard to debug or test or _run_ without the IDE...)<p>All in all I love the way the idea works right now, although Java feels less than elegant on the Mac.<p>*(er, although for me it's got a killer bug - using the hotkey to make a screenshot does not work, gives no option of cancelling... hardcore crasher in my book)
This tool looks really interesting - and I love the idea that it can be programmed using python. I've used a number of GUI automation tools in the past like autohotkeys (which I can also highly recommend) - this one looks like it would make it easier to do certain tasks that are difficult in autohotkeys, for example: interacting with webpages or other applications that don't have standard interfaces that can be examined with system api's.<p>The screenshot approach this tool takes is very unique. My only criticism is that, judging by the video, the image processing approach seems slow compared to an autohotkey's script.<p>What I'm really waiting for is a tool that can take this one step further and do OCR on any on screen text. This would make it easy to interact with gui's that present text that can't be read using system api's - imho that would be the holy grail of gui automation.
If someone could take this concept a step further and let you create a self contained process that users could download and run just by clicking (like tasks in photoshop), I could see some uses:<p>- Some tech support situations where you have to have a user do x amount of steps on their computer that are the same for all users. Sort of like an automated Geek Squad.<p>- Sell a prepackaged GTD style organization system that creates all the folders for you in the right places, downloads files (pre-made budget spreadsheet for example) into them, etc. (trivial, but it's a pain point for people)<p>- Make a bunch of different productivity apps that mimic the steps a professional programmer/ photographer/ marketer etc does when they first setup a new computer (bookmarks, preference settings, etc.)
Clearly Sikuli has flaws, but for a research project, their presentation and execution is impressive. Their efforts should be commended. Hopefully they'll continue enhancing their scripting environment so that the scripts are robust to significant variation in the GUI.
Very cool, but would have major limitations outside of the just making a "personal script" or, at best, a script for a heavily locked-down enterprise/academic setup.<p>Because it uses literal images, it seems like any change in OS theme, OS version, app version, localization (e.g. text or control shape), or colors (e.g. high contrast mode) would break the scripts.<p>It'd be neat to use for GUI automation during software development except for the fact that the GUI changes, the button wordings are tweaked, etc.<p>In all of these cases, back-end or OS GUI automation is probably better, but if you have an unchanging environment or want a quick on-the-fly test, the screenshot approach is novel and probably a bit cooler.
Agreed the demo object is silly, but they are problems that are hard to solve without GUI automation. For example, this tool could be great for scrapping flash-based websites, which are notoriously painful to automate. And the integration with python means that you can easily mix and match with conditional statements, calls to OCR libraries, etc...
This is a much nicer and an intuitive alternative for <a href="http://autohotkey.com" rel="nofollow">http://autohotkey.com</a> on windows. I've tried introducing autohotkey at work to automate some of the mundane tasks, but the learning curve of autohotkey was difficult for most of my co-workers. I'm going to introduce this at my workplace.
If you skip to the last 30 seconds of the first SIX MINUTE video tutorial, you can see the app in action. Otherwise, you have to sit through a whole class on how to use the app before you even know if you want to use it.<p>Little lesson in creating a good video demo....<p>Get to the point.<p>Then provide more videos for details.<p>(I guess you could say this should be expected from an MIT project website)
It looks like a more advanced version of tools like Quick Test Pro.<p>There is big money in tools like that, but I can tell you, its a real PITA to write test scripts using tools such as these. Given the option, you are better off exposing your app's object model to a scripting language, and letting testers script it like that.<p>Obviously that doesn't work for third-party or legacy apps. So it definitely has a market. And their computer vision algorithms have to be better than the godawful bitmap comparison tools that QTP used.
The best use case I can think of for this is writing automated test cases for a browser-based app. Selenium does a pretty good job of that already.<p>The demo (automatically setting an IP) is a one-time job. How many times do we have to do this task? So there is no need for me to automate those kind of jobs. But having said that, this could still be useful in some use cases. One example I could think of is testing desktop apps.
This is incredibly useful. That's why Redstone Software has been selling it for years, under the name Eggplant - see <a href="http://www.testplant.com/products/eggplant_functional_tester" rel="nofollow">http://www.testplant.com/products/eggplant_functional_tester</a> .
It takes a lot of work in QA to figure out why this is useful (back me up on t his one, experienced QA engineers) and the right way to do it so I'll give you the Cliff Notes:
This sort of bitmap recognition lets you automate that "last mile" QA groups can never seem to automate. autohotkeys, selenium, and other things all help automate lots of aspects of the interface with tons of caveats and gotchas. This is a much more useful, if less pleasingly elegant, solution.
When you are automating testing it's relatively easy to automate back-endstuff, write unit tests, write scripts wrapping cli interfaces and so on, but every automation team that deals with GUIs eventually stubs their toe on automating the user interface. BY having the computer automate the GUI task in the same way a human user executes it ( "I want to click the Apple Menu - Where is the Apple icon I know is on top of the Apple Menu? - Ah! There it is! I'll click it" ) you make it easier, or even possible, for the people writing the qa automation to automate the GUI in a reasonable amount of time.
There are some pitfalls. What if someone changes the theme on the automation rig? Well, you're an engineering team, not a preschool - DON'T change the theme!
What if somebody changes an icon in the app you're testing? Fortunately you have access to the bitmap (it's saved with the rst of the build files, yeah?) and of course the change notes for the build tell you hte iocon has been updated. Well, of course it isn't in the change notes, but when a test that was working fails you can easily run to the point where it says "Can't find the foo button." This is a hint to look for the foo button and think about why it can't be found.
Finally, all good scripting languages have an escape hatch to call otehr programs that can do things better than they can and return a result. Need to check an old COM object through its native interface? Write a small Windows app that your script calls to get that state.
It takes a lot of experience and frustration with trying to fully automate tests on a GUI to understand why this is useful. and the cry of "Bitmaps break because things cahnge" - well, no they don't. Not on a computer . Not if you know what you're doing and have control of the source. (Please disable all auto-update systems on your test rig or you will be surprised at some point.)
How often do you have to read and re-type an error message to Google, because the text cannot be copy and pasted? This technology could OCR the screen text and Google it for you automatically.<p>The demo video is proof of concept; make sure you read the paper.<p><a href="http://sikuli.csail.mit.edu/documentation.shtml" rel="nofollow">http://sikuli.csail.mit.edu/documentation.shtml</a>
I've had to use some non-scriptable, proprietary software that this might actually be useful for in doing repetitive tasks. This is especially true at some places where I have done some engineering consulting (non-software). It would probably fall in the category of ugly hack, but would also save some headache for me.
Here's a more ambitious vision of this idea published in 2000:<p><a href="http://www.cs.washington.edu/homes/lsz/papers/slpz-cacm00.pdf" rel="nofollow">http://www.cs.washington.edu/homes/lsz/papers/slpz-cacm00.pd...</a>
How well this would work for game playing bots? If this can abstract away the detection and clicking of regions, it would make building one much more approachable.