TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Automate any GUI using screenshots

197 pointsby ciudiloover 15 years ago

25 comments

Corradoover 15 years ago
A lot of folks on here are saying that this is cool but useless because there are better ways to click a button on a screen. If you read through their paper (<a href="http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.pdf" rel="nofollow">http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2...</a>) you'll find more practical examples of what can be done with this type of system.<p>One such example is to track real time images of a webcam pointed at a baby and using Sikuli to watch for a yellow dot placed on the baby's forehead. Another is to track movements of something across the screen; in this case a bus moving along Google maps.<p>I agree that there are better ways to do most of the things in their examples and that they should probably re-work their videos a bit, but just because this system doesn't solve your problems the way you want it to doesn't mean its useless.
评论 #1073434 未加载
评论 #1073241 未加载
tbrownawover 15 years ago
Hm, makes me think of <a href="http://blog.objectmentor.com/articles/2008/06/22/observations-on-test-driving-user-interfaces" rel="nofollow">http://blog.objectmentor.com/articles/2008/06/22/observation...</a> .<p>It sounds... scary. Like it will work well enough at first, and then explode when someone changes their desktop theme (especially icon theme), or wants to upgrade to a new version of whatever. Treating things as change-controlled APIs when they aren't just seems dangerous. Still I guess there's at least some amount of change control coming from platform conventions and human interface guidelines, and this comes closer to operating at the correct level of abstraction to benefit from that.
评论 #1072987 未加载
liuliuover 15 years ago
Sometimes I see failures of HN to collaboratively discover this kind of interesting topics. I've posted the paper much earlier than the media coverage (more than 100 days ago): <a href="http://news.ycombinator.com/item?id=810986" rel="nofollow">http://news.ycombinator.com/item?id=810986</a> and there is no vote up.
评论 #1072861 未加载
评论 #1072851 未加载
评论 #1073044 未加载
评论 #1072815 未加载
vdmover 15 years ago
The people dissing this have obviously never dreamed of automating a 16-bit Visual Basic 3 Windows app (that's Win16, not Win32) so it can be run from a webapp front-end and gradually obsolesced.<p>Autohotkey works, but matching by screenshots with computer vision would cut the amount of work required in half.<p>Bravo!
nex3over 15 years ago
I can imagine this being useful for knowing to stop when things start going wrong. One problem I've had with GUI-automators in the past is that they've just kept automating after something unexpected happened and put them into an invalid state. It seems like Sikuli could avoid this by literally knowing when the screen looks wrong.
jrockwayover 15 years ago
Rube Goldberg would be proud.<p>(Sometimes you should take a step back and ask yourself, "is looking for pictures on the screen <i>really</i> the best way to do this"? The example they show on the main page is a one-line "ifconfig" invocation, for example.)
评论 #1072755 未加载
评论 #1072754 未加载
sacriliciousover 15 years ago
Two wonderful things about this: 1. as frankenstein-ed together as the tech is, it works* 2. this is arguably more natural than 'workflow' recording functionality like automator, and I found the actual 'code' highly readable(although inscrutably hard to debug or test or _run_ without the IDE...)<p>All in all I love the way the idea works right now, although Java feels less than elegant on the Mac.<p>*(er, although for me it's got a killer bug - using the hotkey to make a screenshot does not work, gives no option of cancelling... hardcore crasher in my book)
actfover 15 years ago
This tool looks really interesting - and I love the idea that it can be programmed using python. I've used a number of GUI automation tools in the past like autohotkeys (which I can also highly recommend) - this one looks like it would make it easier to do certain tasks that are difficult in autohotkeys, for example: interacting with webpages or other applications that don't have standard interfaces that can be examined with system api's.<p>The screenshot approach this tool takes is very unique. My only criticism is that, judging by the video, the image processing approach seems slow compared to an autohotkey's script.<p>What I'm really waiting for is a tool that can take this one step further and do OCR on any on screen text. This would make it easy to interact with gui's that present text that can't be read using system api's - imho that would be the holy grail of gui automation.
mlapeterover 15 years ago
If someone could take this concept a step further and let you create a self contained process that users could download and run just by clicking (like tasks in photoshop), I could see some uses:<p>- Some tech support situations where you have to have a user do x amount of steps on their computer that are the same for all users. Sort of like an automated Geek Squad.<p>- Sell a prepackaged GTD style organization system that creates all the folders for you in the right places, downloads files (pre-made budget spreadsheet for example) into them, etc. (trivial, but it's a pain point for people)<p>- Make a bunch of different productivity apps that mimic the steps a professional programmer/ photographer/ marketer etc does when they first setup a new computer (bookmarks, preference settings, etc.)
nodogbiteover 15 years ago
Clearly Sikuli has flaws, but for a research project, their presentation and execution is impressive. Their efforts should be commended. Hopefully they'll continue enhancing their scripting environment so that the scripts are robust to significant variation in the GUI.
thoraxover 15 years ago
Very cool, but would have major limitations outside of the just making a "personal script" or, at best, a script for a heavily locked-down enterprise/academic setup.<p>Because it uses literal images, it seems like any change in OS theme, OS version, app version, localization (e.g. text or control shape), or colors (e.g. high contrast mode) would break the scripts.<p>It'd be neat to use for GUI automation during software development except for the fact that the GUI changes, the button wordings are tweaked, etc.<p>In all of these cases, back-end or OS GUI automation is probably better, but if you have an unchanging environment or want a quick on-the-fly test, the screenshot approach is novel and probably a bit cooler.
onyracover 15 years ago
Agreed the demo object is silly, but they are problems that are hard to solve without GUI automation. For example, this tool could be great for scrapping flash-based websites, which are notoriously painful to automate. And the integration with python means that you can easily mix and match with conditional statements, calls to OCR libraries, etc...
amjithover 15 years ago
This is a much nicer and an intuitive alternative for <a href="http://autohotkey.com" rel="nofollow">http://autohotkey.com</a> on windows. I've tried introducing autohotkey at work to automate some of the mundane tasks, but the learning curve of autohotkey was difficult for most of my co-workers. I'm going to introduce this at my workplace.
dpcanover 15 years ago
If you skip to the last 30 seconds of the first SIX MINUTE video tutorial, you can see the app in action. Otherwise, you have to sit through a whole class on how to use the app before you even know if you want to use it.<p>Little lesson in creating a good video demo....<p>Get to the point.<p>Then provide more videos for details.<p>(I guess you could say this should be expected from an MIT project website)
timfover 15 years ago
cf. <a href="http://news.ycombinator.com/item?id=1069608" rel="nofollow">http://news.ycombinator.com/item?id=1069608</a>
kenshiover 15 years ago
It looks like a more advanced version of tools like Quick Test Pro.<p>There is big money in tools like that, but I can tell you, its a real PITA to write test scripts using tools such as these. Given the option, you are better off exposing your app's object model to a scripting language, and letting testers script it like that.<p>Obviously that doesn't work for third-party or legacy apps. So it definitely has a market. And their computer vision algorithms have to be better than the godawful bitmap comparison tools that QTP used.
subbuover 15 years ago
The best use case I can think of for this is writing automated test cases for a browser-based app. Selenium does a pretty good job of that already.<p>The demo (automatically setting an IP) is a one-time job. How many times do we have to do this task? So there is no need for me to automate those kind of jobs. But having said that, this could still be useful in some use cases. One example I could think of is testing desktop apps.
ststratover 15 years ago
This is incredibly useful. That's why Redstone Software has been selling it for years, under the name Eggplant - see <a href="http://www.testplant.com/products/eggplant_functional_tester" rel="nofollow">http://www.testplant.com/products/eggplant_functional_tester</a> . It takes a lot of work in QA to figure out why this is useful (back me up on t his one, experienced QA engineers) and the right way to do it so I'll give you the Cliff Notes: This sort of bitmap recognition lets you automate that "last mile" QA groups can never seem to automate. autohotkeys, selenium, and other things all help automate lots of aspects of the interface with tons of caveats and gotchas. This is a much more useful, if less pleasingly elegant, solution. When you are automating testing it's relatively easy to automate back-endstuff, write unit tests, write scripts wrapping cli interfaces and so on, but every automation team that deals with GUIs eventually stubs their toe on automating the user interface. BY having the computer automate the GUI task in the same way a human user executes it ( "I want to click the Apple Menu - Where is the Apple icon I know is on top of the Apple Menu? - Ah! There it is! I'll click it" ) you make it easier, or even possible, for the people writing the qa automation to automate the GUI in a reasonable amount of time. There are some pitfalls. What if someone changes the theme on the automation rig? Well, you're an engineering team, not a preschool - DON'T change the theme! What if somebody changes an icon in the app you're testing? Fortunately you have access to the bitmap (it's saved with the rst of the build files, yeah?) and of course the change notes for the build tell you hte iocon has been updated. Well, of course it isn't in the change notes, but when a test that was working fails you can easily run to the point where it says "Can't find the foo button." This is a hint to look for the foo button and think about why it can't be found. Finally, all good scripting languages have an escape hatch to call otehr programs that can do things better than they can and return a result. Need to check an old COM object through its native interface? Write a small Windows app that your script calls to get that state. It takes a lot of experience and frustration with trying to fully automate tests on a GUI to understand why this is useful. and the cry of "Bitmaps break because things cahnge" - well, no they don't. Not on a computer . Not if you know what you're doing and have control of the source. (Please disable all auto-update systems on your test rig or you will be surprised at some point.)
vdmover 15 years ago
How often do you have to read and re-type an error message to Google, because the text cannot be copy and pasted? This technology could OCR the screen text and Google it for you automatically.<p>The demo video is proof of concept; make sure you read the paper.<p><a href="http://sikuli.csail.mit.edu/documentation.shtml" rel="nofollow">http://sikuli.csail.mit.edu/documentation.shtml</a>
RKover 15 years ago
I've had to use some non-scriptable, proprietary software that this might actually be useful for in doing repetitive tasks. This is especially true at some places where I have done some engineering consulting (non-software). It would probably fall in the category of ugly hack, but would also save some headache for me.
sebastianover 15 years ago
Does anyone know if a sikuli script can be run from the command line without having to use the sikuli IDE?
评论 #1074289 未加载
ideamonkover 15 years ago
Sikuli comes at the right time for me, going to use it to automate browser testing and generate reports :)
amichailover 15 years ago
Here's a more ambitious vision of this idea published in 2000:<p><a href="http://www.cs.washington.edu/homes/lsz/papers/slpz-cacm00.pdf" rel="nofollow">http://www.cs.washington.edu/homes/lsz/papers/slpz-cacm00.pd...</a>
评论 #1072841 未加载
pwimover 15 years ago
How well this would work for game playing bots? If this can abstract away the detection and clicking of regions, it would make building one much more approachable.
dirtboxover 15 years ago
I wonder if I can teach it how to leet at Team Fortress.