The state of AI for hand-drawn animation inbetweening

262 pointsby luuabout 1 year ago

14 comments

shaileshmabout 1 year ago

Great article!This is one of the most overlooked problems in generative AI. It seems so trivial, but in fact, it is quite difficult. The difficulty arises because of the non-linearity that is expected in any natural motion.In fact, the author has highlighted all the possible difficulties of this problem in a much better manner.I started with some simple implementation by trying to move segments around the image using some segmentation mask + ROI. That strategy didn't work out, probably because of some mathematical bug or data insufficiency data. I suspect the later.The whole idea was to draw a segmentation mask on the target image, then draw lines that represent motion and give options to insert keyframes for the lines.Imagine you are drawing a curve from A to A. You divide the curve into A, A_1, A_2... B.Now, given the input of segmentation mask, motion curve, and whole image, we train some model to only move the ROI according to the motion curve and keyframe.The problem with this approach is in sampling the keyframe and matching consistencies --making sure RoI represents same object-- across subsequent keyframes.If we are able to solve some form of consistency, this method might be able to give enough constraints to generate viable results.

评论 #40086556 未加载

nicklecompteabout 1 year ago

Great read - I learned quite a bit about this. The only quibble I had is at the end, and it's a very general layperson's view:> But what’s even more impressive - extremely impressive - is that the system decided that the body would go up before going back down between these two poses! (Which is why it’s having trouble with the right arm in the first place! A feature matching system wouldn’t have this problem, because it wouldn’t realize that in the middle position, the body would go up, and the right arm would have to be somewhere. Struggling with things not visible in either input keyframe is a good problem to have - it’s evidence of knowing these things exist, which demonstrates quite the capabilities!).... This system clearly learned a lot about three-dimensional real-world movement behind the 2D images it’s asked to interpolate between.I think that's an awfully strong conclusion to draw from this paper - the authors certainly don't make that claim. The "null hypothesis" should be that most generative video models have a ton of yoga instruction videos shot very similarly to the example shown, and here the AI is simply repeating similar frames from similar videos. Since this most likely wouldn't generalize to yoga videos shot at a skew angle, it's hard to conclude that the system learned anything about 3D real-world movement. Maybe it did! But the study authors didn't come to that conclusion, and since their technique is actually model-independent, they wouldn't be in a good position to check for data contamination / overfitting / etc. The authors seem to think the value of their work is that generative video AI is by default past->future but you can do future->past without changing the underlying model, and use that to smooth out issues in interpolation. I just don't think there's any rational basis for generalizing this to understanding 3D space itself.This isn't a criticism of the paper - the work seems clever but the paper is not very detailed and they haven't released the code yet. And my complaint is only a minor editorial comment on an otherwise excellent writeup. But I am wondering if the author might have been bedazzled by a few impressive results.

评论 #40086507 未加载

empath-nirvanaabout 1 year ago

The results are actually shockingly bad, considering that I think this should be _easier_ than producing a realistic image from scratch, which ai does quite well.I don't have more than a fuzzy idea of how to implement this, but it seems to me that key frames _should_ be interchangeable with in between frames, so you want to train it so that if you start with key frames and generate in-between frames, and then run the in-between frames through the ai, it should regenerate the keyframes.

评论 #40088662 未加载

评论 #40086975 未加载

评论 #40089628 未加载

评论 #40089208 未加载

评论 #40088079 未加载

kleibaabout 1 year ago

A blog post from the same guy that used to maintain the C++ FQA!<a href="https://yosefk.com/c++fqa/" rel="nofollow">https://yosefk.com/c++fqa/</a>

评论 #40086417 未加载

TiredGuyabout 1 year ago

The state of the art of 3d pose estimation and pose transfer from video seems to be pretty accurate. I wonder if another approach might be to infer a wireframe model for the character, then tween that instead of the character itself. It would be like the vector approach described in the article but with much, much fewer vertices, then once you have the tween, use something similar to pose transfer to map the most recent character's frame depiction to the pose.Training on a wireframe model seems like it would be easier, since there are plenty of wireframe animations out there (at least for humans) you could use and remove in-between frames to try inferring them.

aredoxabout 1 year ago

Now that's a great use for AI! Inbetweening has always been a thankless job usually outsources to sweatshop-like studios in Vietnam and - even recently - North Korea (Guy Delisle's "Pyongyang" comic is about his experience as a studio liaison there).And AI has less room to hallucinate - it is more a kind of interpolation - even if in this short curt example, the AI still "morphs" instead of cleanly transitioning.The real animation work and talent is in keyframes, not in the inbetweening.

评论 #40090427 未加载

评论 #40086959 未加载

xrdabout 1 year ago

This was really fun. It captured a lot of thinking on a topic I've been interested in for a while as well.The discussion about converting to a vector format was an interesting diversion. I've been experimenting with using potrace from inkscape to migrate raster images into SVG and then use animation libraries inside the browser to morph them, and this idea seems like it shares some concepts.One of my favorite films is A Scanner Darkly, and that used a technique called rotoscoping which I recall was a combination of hand tracing animation and computers then augmenting it, or vice versa. It sounded similar. The Wikipedia page talks about the director Richard Linklater and also the MIT professor Bob Sabiston who pioneered that derivative digital technique. It was fun to read that.<a href="https://en.m.wikipedia.org/wiki/Rotoscoping" rel="nofollow">https://en.m.wikipedia.org/wiki/Rotoscoping</a><a href="https://en.m.wikipedia.org/wiki/Bob_Sabiston" rel="nofollow">https://en.m.wikipedia.org/wiki/Bob_Sabiston</a>

评论 #40085053 未加载

atseajournalabout 1 year ago

Animation has to be the most intriguing hobby I'm never planning on engaging with, so this kind of blog post is great for me.I know hand-drawn 2D is its own beast, but what's your thought on using 3D datasets for handling the occlusion problem? There's so much motion-capture data out there -- obviously almost none of it has the punchiness and appeal of hand-drawn 2D, but feels like there could be something there. I haven't done any temporally-consistent image gen, just playing around with StableDiffusion for stills, but the ControlNets that make use of OpenPose are decent.3D is on my mind here because the Spiderverse movies seemed like the first demonstration of how to really blend the two styles. I know they did some bespoke ML to help their animators out by adding those little crease-lines to a face as someone smiles... pretty sure they were generating 3d splines however, not raster data.Anyway, I'm saving the RSS feed, hope to hear more about this in the future!

评论 #40087397 未加载

评论 #40087637 未加载

_akheabout 1 year ago

Really cool read, I liked seeing all the examples.I wonder if it would be beneficial to train on lots of static views of the character too - not just the frames - so that permanent features like the face gets learned as a chunk of adjacent pixels, so when you go to make a walking animation, the relatively low amount of training data on moving legs in comparison to the high repetition of faces would cause only the legs to blur unpredictably, where the faces would be more in tact - the overall result might be a clearer looking animation.

评论 #40089148 未加载

chris_stabout 1 year ago

Not sure who would fund this research? Perhaps the Procreate Dreams folks [0]. I'm sure they'd love to have a functional auto-tween feature.0: <a href="https://procreate.com/dreams" rel="nofollow">https://procreate.com/dreams</a>

wantsanagentabout 1 year ago

Frankly I'm surprised this isn't much higher quality. The hard thing in the transformers era of ML is getting enough data that fits into the next token or masked language modeling paradigm, however in this case, inbetweening is exactly that task and every hand-drawn animation in history is potential training data.I'm not surprised that using off the shelf diffusion models or multi-modal transformer models trained primarily on still images would lead to this level of quality, but I am surprised if these results are from models trained specifically for this task on large amounts of animation data.

评论 #40087479 未加载

meindnochabout 1 year ago

Convert frames to SVG, pass SVG as text to ChatGPT, then ask for the SVG of in-between frames. Simple as.

brcmthrowawayabout 1 year ago

If an AI could ever capture the charm of the original hand drawn animation, then it's over for us

评论 #40091811 未加载

Solvencyabout 1 year ago

i don't get it. we essentially figured this out in the first Matrix by having a bunch of cameras film an actor and then used interpolation to create a 360 shot from it.why can't this basic idea be applied to simple 2d animation over two decades later?

评论 #40091637 未加载

评论 #40104720 未加载