Suggestion for the OP - read more about computer vision.<p>Extracting gestures is indeed a problem. Most of the approaches I know depend on a state triggered by the appearance of a new input (in the video, when you add or a remove a finger) and then work by doing a temporal sum of the movement to get a shape.<p>This of course introduce problems about how fast or how slow the person draws the shape in the air - unless you trigger that when a finger is added, a finger is removed (as explained before) <i>OR</i> when you have just successfully detected a gesture - I don't mean identified it, but a quick deceleration of the finger followed by a short immobilization of the finger can reset the "frame of reading".<p>You may or may not have successfully grasped what was before that shape, but then an human will usually stop and try again so you get to join the right "frame of reading"<p>I've done a little work (computer vision MA thesis) on using Gestalt perceptual grouping on 3d+t (video) imaging. The goal was automating sign language interpretation (especially when shapes are drawn in the air, something very popular with the French Sign Language - and therefore I suppose with the American Sign Language considering how close they are linguistically)<p>However we were far from that in 2003, and we used webcams only. A lot of work went to separate each finger - depending on many things on its relative position to other, ie at the extremity of the row you either have the index or the pinky, and you guess which one if you know which hand it is, and which side is facing the camera)<p>I don't think it is or even it was <i>that</i> innovative. I've stopped working on that, so I guess there must have been lot of new innovative approaches. So once again, go read more about computer vision. It's fascinating!<p>I'd be happy to send anyone a copy, but it's in french :-)