This has been my experience. Foundation models have completely changed the game of ML. Previously, companies might have needed to hire ML engineers familiar with ML training, architectures etc to get mediocre results. Now companies can just hire a regular software engineer familiar with foundation model API’s to get excellent results. In some ways it is sad, but in other ways the result you get is so much better than we achieved before.<p>My example was an image segmentation model. I managed to create an dataset of 100,000+ images and was training UNets and other advanced models on it, always reached a good validation loss but my data was simply not diverse enough and I faced a lot of issues in actual deployment, where the data distribution kept changing on a day to day basis. Then, I tried DINO v2 from Meta, finetuned on 4 images and it solved the problem, handled all the variations in lighting etc with far higher accuracy than I ever achieved. It makes sense, DINO was train on 100M + images, I would never be able to compete with that.<p>In this case, the company still needed my expertise, because Meta just released the weights and so someone had to setup the fine-tuning pipeline. But I can imagine a fine tuning API like OpenAI’s requiring no expertise outside of simple coding. If AI results depend on scale, it naturally follows that only a few well funded companies, will build AI that actually works, and everyone else will just use their models. The only way this trend reverses, is if compute becomes so cheap and ubiquitous, that everyone can achieve the necessary scale.
It's tough to judge without seeing examples of the targets and the user photos, but I'm curious if this could be done with just old-school SIFT. If it really is exactly the same image in the in the corpus and on the wall, does a neural embedding model really buy you a lot? A small number of high confidence tie points seems like it'd be all you need, but it probably depends a lot on just how challenging the user photos are.
Alternative solution that would require less heavy lifting of ML but a little more upfront programming:
It sounds like the cars are arranged in a grid on the wall. Maybe it would be possible to narrow down which car the user took a photo of by looking at the photos of the surrounding cars as well, and hardcoding into the system the position of each car relative to one another?
Could potentially do that locally very quickly (maybe even at the level of QR-code speed) versus doing an embedding + LLM.<p>Con of this approach would be that it’s requires maintenance if they ever decide to change the illustration positions.
This tracks with my experience.
We built a complex processing pipeline for an NLP classification, search and comprehension task. Using vector database of Proprietary data etc.<p>We ran a benchmark of our system against an LLM call and the LLM performed much better for so much cheaper, in terms of dev time, complexity, and compute.
Incredible time to be in working in the space seeing traditional problems eaten away by new paradigms
Interesting approach to a a very interesting challenge, given how close the images supposedly are.<p>With the limited training data they have I'm surprised they don't mention any attempts at synthetic training data. Make (or buy) a couple museum scenes in blender, hang one of the images there, take images from a lot of angles, repeat for more scenes, lighting conditions and all 350 images. Should be easy to script. Then train YOLO on those images, or if that still fails use their embedding approach with those training images.
First time for me posting this kind of story - I thought it would make an interesting case on solving a hard computer vision problem with a crafty product engineer team.
Very neat explanation of solving these kinds of unique challenges, especially given how similar the illustrations were.<p>One question I had was, knowing how difficult it was to train the model with the base images, and given that the client didn’t have time to photograph them, did you consider flying someone out to the museum for a couple of days to photograph each illustration from several angles with the actual lighting throughout the day? Or potentially hiring a photographer near the museum to do that? It seems like a round trip ticket plus a couple nights in a hotel could have saved a lot of headache, providing more images to turn into synthetic training data. Even if you still had to resort to using 4o as a tiebreaker, it could be that you only present two candidates as the third might have a much lower similarity score to the second candidate.
Good write up either way.
Huh I think this YouTube short is the same topic: <a href="https://youtube.com/shorts/DA_-6296G5o?si=BLKcSP2Q1jAaca9K" rel="nofollow">https://youtube.com/shorts/DA_-6296G5o?si=BLKcSP2Q1jAaca9K</a><p>Finding new geoglyphs from known examples.
A bit tangential, but I think we will see a good chunk of small teams building competing products in different software business segments, by just doubling on productivity and offering a cheaper option due to less operational overhead (reads: paying engineers). I can think of at least two businesses that can be competed in costs if the team can automate a good chunk of it.
A completely different approach that don't require heavy AI would be an app on the user phone that does this:<p>1. Measure the distance from the wall (standard image processing)<p>2. Use the rotations of the gyro sensors on the phone to conclude which car is being looked at<p>I wonder if this could be as accurate though
Thanks for the “bitter lesson” news from the frontlines. Curious; did you experiment with 4o as the sole pipeline? And of course as I think you mention, it would be interesting to know if say llama 8b could do a similar job as well.<p>Congrats on shipping.
Side question: is there any good model that allows for image similarity detection across a large image set, that can be incrementally augmented with new images?<p>You'd somehow have to generate an embedding for each image, I presume.
Cool real life use Case.
Don't think lmms usually get applied reasonably where they should be and I am glad that a generic knn model also was used to simplify costs and also just more suitable.
reads to me like 95% of the "conventional AI" was applied to the problem and then using llm in the end seems to work like a lucky three-faced dice.<p>when "embeddings" are used to perform closeness test, you are using a pretrained computer vision model behind the scenes. it is doing the far majority of tasks of filtering out hundreds of images down to a handful.<p>visual llm works on textual descriptions that seem far too close for similar images. regardless, more power to the team for finding something that works for them.
Calling an llm and a cv model by the same name to give the appearance of agi is a pet peeve of mine.<p>And someone that's not openai buying into this naming convention is just unpaid propaganda
This was a fun read. I’m not a AI expert by any means. I’m also ESL. Please bear with me.<p>However the inaccuracy threshold seems fine for a museum, but in enterprise operations inaccuracy can mean lost revenue or worse lost trust and future business flow.<p>I’m struggling with some more advanced AI use cases in my collaborative work platform. I use AI (LLMs) for things like summarizations, communication, finding information using embedding. However, sometimes it is completely wrong.<p>To test this I spent a few days (doing something unrelated) building up a recipes database and then trying to query it for things like “I want to make a quick and easy drink”. I ran the data through classification and other steps to get as good data as I could. The results would still include fries or some other food result when I’m asking for drinks.<p>So I have to ask what the heck am I doing wrong? Again, for things like sending messages and reminders or coming up with descriptions, and finding old messages that match some input - no problem.<p>But if I have data that I’m augmenting with additional information (trying to attach more information that maybe missing but possible to deduce from what’s available) to try and enable richer workflows I’m always being bit in the butt. I feel like if I can figure this out I can provide way more value.<p>Not sure if what I said makes sense.