Great article!<p>This is one of the most overlooked problems in generative AI. It seems so trivial, but in fact, it is quite difficult. The difficulty arises because of the non-linearity that is expected in any natural motion.<p>In fact, the author has highlighted all the possible difficulties of this problem in a much better manner.<p>I started with some simple implementation by trying to move segments around the image using some segmentation mask + ROI. That strategy didn't work out, probably because of some mathematical bug or data insufficiency data. I suspect the later.<p>The whole idea was to draw a segmentation mask on the target image, then draw lines that represent motion and give options to insert keyframes for the lines.<p>Imagine you are drawing a curve from A to A. You divide the curve into A, A_1, A_2... B.<p>Now, given the input of segmentation mask, motion curve, and whole image, we train some model to only move the ROI according to the motion curve and keyframe.<p>The problem with this approach is in sampling the keyframe and matching consistencies --making sure RoI represents same object-- across subsequent keyframes.<p>If we are able to solve some form of consistency, this method might be able to give enough constraints to generate viable results.
Great read - I learned quite a bit about this. The only quibble I had is at the end, and it's a very general layperson's view:<p>> But what’s even more impressive - extremely impressive - is that the system decided that the body would go up before going back down between these two poses! (Which is why it’s having trouble with the right arm in the first place! A feature matching system wouldn’t have this problem, because it wouldn’t realize that in the middle position, the body would go up, and the right arm would have to be somewhere. Struggling with things not visible in either input keyframe is a good problem to have - it’s evidence of knowing these things exist, which demonstrates quite the capabilities!).... This system clearly learned a lot about three-dimensional real-world movement behind the 2D images it’s asked to interpolate between.<p>I think that's an awfully strong conclusion to draw from this paper - the authors certainly don't make that claim. The "null hypothesis" should be that most generative video models have a ton of yoga instruction videos shot very similarly to the example shown, and here the AI is simply repeating similar frames from similar videos. Since this most likely wouldn't generalize to yoga videos shot at a skew angle, it's hard to conclude that the system learned <i>anything</i> about 3D real-world movement. Maybe it did! But the study authors didn't come to that conclusion, and since their technique is actually model-independent, they wouldn't be in a good position to check for data contamination / overfitting / etc. The authors seem to think the value of their work is that generative video AI is by default past->future but you can do future->past without changing the underlying model, and use that to smooth out issues in interpolation. I just don't think there's any rational basis for generalizing this to understanding 3D space itself.<p>This isn't a criticism of the paper - the work seems clever but the paper is not very detailed and they haven't released the code yet. And my complaint is only a minor editorial comment on an otherwise excellent writeup. But I am wondering if the author might have been bedazzled by a few impressive results.
The results are actually shockingly bad, considering that I think this should be _easier_ than producing a realistic image from scratch, which ai does quite well.<p>I don't have more than a fuzzy idea of how to implement this, but it seems to me that key frames _should_ be interchangeable with in between frames, so you want to train it so that if you start with key frames and generate in-between frames, and then run the in-between frames through the ai, it should regenerate the keyframes.
A blog post from the same guy that used to maintain the C++ FQA!<p><a href="https://yosefk.com/c++fqa/" rel="nofollow">https://yosefk.com/c++fqa/</a>
The state of the art of 3d pose estimation and pose transfer from video seems to be pretty accurate. I wonder if another approach might be to infer a wireframe model for the character, then tween that instead of the character itself. It would be like the vector approach described in the article but with much, much fewer vertices, then once you have the tween, use something similar to pose transfer to map the most recent character's frame depiction to the pose.<p>Training on a wireframe model seems like it would be easier, since there are plenty of wireframe animations out there (at least for humans) you could use and remove in-between frames to try inferring them.
Now that's a great use for AI! Inbetweening has always been a thankless job usually outsources to sweatshop-like studios in Vietnam and - even recently - North Korea (Guy Delisle's "Pyongyang" comic is about his experience as a studio liaison there).<p>And AI has less room to hallucinate - it is more a kind of interpolation - even if in this short curt example, the AI still "morphs" instead of cleanly transitioning.<p>The real animation work and talent is in keyframes, not in the inbetweening.
This was really fun. It captured a lot of thinking on a topic I've been interested in for a while as well.<p>The discussion about converting to a vector format was an interesting diversion. I've been experimenting with using potrace from inkscape to migrate raster images into SVG and then use animation libraries inside the browser to morph them, and this idea seems like it shares some concepts.<p>One of my favorite films is A Scanner Darkly, and that used a technique called rotoscoping which I recall was a combination of hand tracing animation and computers then augmenting it, or vice versa. It sounded similar. The Wikipedia page talks about the director Richard Linklater and also the MIT professor Bob Sabiston who pioneered that derivative digital technique. It was fun to read that.<p><a href="https://en.m.wikipedia.org/wiki/Rotoscoping" rel="nofollow">https://en.m.wikipedia.org/wiki/Rotoscoping</a><p><a href="https://en.m.wikipedia.org/wiki/Bob_Sabiston" rel="nofollow">https://en.m.wikipedia.org/wiki/Bob_Sabiston</a>
Animation has to be the most intriguing hobby I'm never planning on engaging with, so this kind of blog post is great for me.<p>I know hand-drawn 2D is its own beast, but what's your thought on using 3D datasets for handling the occlusion problem? There's so much motion-capture data out there -- obviously almost none of it has the punchiness and appeal of hand-drawn 2D, but feels like there could be something there. I haven't done any temporally-consistent image gen, just playing around with StableDiffusion for stills, but the ControlNets that make use of OpenPose are decent.<p>3D is on my mind here because the Spiderverse movies seemed like the first demonstration of how to really blend the two styles. I know they did some bespoke ML to help their animators out by adding those little crease-lines to a face as someone smiles... pretty sure they were generating 3d splines however, not raster data.<p>Anyway, I'm saving the RSS feed, hope to hear more about this in the future!
Really cool read, I liked seeing all the examples.<p>I wonder if it would be beneficial to train on lots of static views of the character too - not just the frames - so that permanent features like the face gets learned as a chunk of adjacent pixels, so when you go to make a walking animation, the relatively low amount of training data on moving legs in comparison to the high repetition of faces would cause only the legs to blur unpredictably, where the faces would be more in tact - the overall result might be a clearer looking animation.
Not sure who would fund this research? Perhaps the Procreate Dreams folks [0]. I'm sure they'd love to have a functional auto-tween feature.<p>0: <a href="https://procreate.com/dreams" rel="nofollow">https://procreate.com/dreams</a>
Frankly I'm surprised this isn't <i>much</i> higher quality. The hard thing in the transformers era of ML is getting enough data that fits into the next token or masked language modeling paradigm, however in this case, inbetweening is exactly that task and every hand-drawn animation in history is potential training data.<p>I'm not surprised that using off the shelf diffusion models or multi-modal transformer models trained primarily on still images would lead to this level of quality, but I <i>am</i> surprised if these results are from models trained specifically for this task on large amounts of animation data.
i don't get it. we essentially figured this out in the first Matrix by having a bunch of cameras film an actor and then used interpolation to create a 360 shot from it.<p>why can't this basic idea be applied to simple 2d animation over two decades later?