In the demo I put the obama prank photo <a href="http://karpathy.github.io/2012/10/22/state-of-computer-vision/" rel="nofollow">http://karpathy.github.io/2012/10/22/state-of-computer-visio...</a> and asked "Why is this picture funny?" and it responded "Question: Why is this picture funny? Answer: President Obama is taller than the average person."
I always like to try these zero-shot models on things outside of the "normal" COCO classes. Here are some chess board queries:<p>Counting: <a href="https://imgur.com/KTuQ1Bv" rel="nofollow">https://imgur.com/KTuQ1Bv</a><p>Parse the chess board: <a href="https://imgur.com/2zYFK1P" rel="nofollow">https://imgur.com/2zYFK1P</a><p>(Result): <a href="https://imgur.com/Ei4MAl7" rel="nofollow">https://imgur.com/Ei4MAl7</a><p>Few-Shot Object Detection (Pascal VOC): <a href="https://imgur.com/gZkDMn8" rel="nofollow">https://imgur.com/gZkDMn8</a><p>Few-Shot Object Detection (simplified): <a href="https://imgur.com/Hk8QGMd" rel="nofollow">https://imgur.com/Hk8QGMd</a><p>Not quite there yet. I've been more impressed with the other new zero-shot multimodal models like Grounding DINO and Azure Dense Captioning. Really looking forward to putting multimodal GPT-4 through its paces as well.
Even at this scale the model's able to answer questions fairly impressively, but I created an image with some distinct shapes in different positions and it didn't go well [0]. I think however they're doing the image encoding doesn't capture positional information which, to my mind, limits a lot of use cases.<p>[0] <a href="https://i.postimg.cc/GtrGs8mw/Screenshot-2023-03-28-at-5-19-55-PM.png" rel="nofollow">https://i.postimg.cc/GtrGs8mw/Screenshot-2023-03-28-at-5-19-...</a>
This is awesome work and they also provide their 9B OpenFlamingo model which is based on Llama:<p><a href="https://huggingface.co/openflamingo/OpenFlamingo-9B" rel="nofollow">https://huggingface.co/openflamingo/OpenFlamingo-9B</a>