I think image-encoder from CLIP (even smallest variant ViT B/32) is good enough to capture a lot of semantic information to allow natural language query once images are indexed. A lot of work actually goes into integrating with existing meta-data like local-directory, date-time to augment NL query and re-ranking the results.<p>I work on such a tool[0] to enable end to end indexing of user's personal photos and recently added functionality to index Google Photos too!<p>[0] <a href="https://github.com/eagledot/hachi">https://github.com/eagledot/hachi</a>
Immich (self hosted Google photos alternative) has been using CLIP models for smart search for a while and anecdotally seems to work really well - it indexes fast and results are of similar quality to the giant SaaS providers.
Since llava is multimodal, I wonder if there's a chance here to strip a bit of complexity. Specifically, instead of going through 3 embeddings (llava internal, text, mini-lm), could you use the not-last layer of llava as your vector? It would probably require a bit of fine-tuning though.<p>For pure text, that's kind of how e5-mistral works <a href="https://huggingface.co/intfloat/e5-mistral-7b-instruct" rel="nofollow">https://huggingface.co/intfloat/e5-mistral-7b-instruct</a> Or yeah, just use clip like another commenter suggests...
Here is an example that builds a vector index of images using the CLIP model.<p><a href="https://neuml.hashnode.dev/similarity-search-with-images" rel="nofollow">https://neuml.hashnode.dev/similarity-search-with-images</a><p>This allows queries with both text and images.
Is anyone aware of a model that is trained to give photos a quality rating? I have decades of RAW files sitting on my server that I would love to pass over and tag those that are worth developing more. Would be nice to make a short list.
For indexing images is probably convenient to directly calculate the embeddings using CLIP image encoder and retrieve them using the CLIP text encoder.
A nice work. I’m thinking it could even be tinkered further by incorporating location information, date and time, and even people (facial recognition) data from the photos, and have an LLM write one “metadata text” for every photo.
This way one can query “ person X traveling with Y to Norway about 7 years ago” and quickly get useful results.
This is pretty awesome, but I’m curious if it can be used to “enhance” the existing iCloud search which is great at identifying people in my photos even kids as they age.<p>I would not want to lose that functionality
Slightly related, are there any good photo management alternatives to Photoprism that leverages more recent AI/ML technologies and provides a GUI for end users?
Is there a state of the art for face matching? I love being able to put in a name and find all the photos they are in.<p>I don't even mind some training of "are these the same or not"<p>That's one of the conveniences that means I'm still using google photos...
I'm still tryna understand the difference between multimodal models like Llava and projects like JARVIS that connect LLMs to other huggingface models (including object detection models) or clip. Is a multimodal model doing this under the hood?