TechEcho

12 comments

warangalover 1 year ago

I think image-encoder from CLIP (even smallest variant ViT B/32) is good enough to capture a lot of semantic information to allow natural language query once images are indexed. A lot of work actually goes into integrating with existing meta-data like local-directory, date-time to augment NL query and re-ranking the results.I work on such a tool[0] to enable end to end indexing of user's personal photos and recently added functionality to index Google Photos too![0] <a href="https://github.com/eagledot/hachi">https://github.com/eagledot/hachi</a>

评论 #39089426 未加载

评论 #39089609 未加载

jsmith99over 1 year ago

Immich (self hosted Google photos alternative) has been using CLIP models for smart search for a while and anecdotally seems to work really well - it indexes fast and results are of similar quality to the giant SaaS providers.

评论 #39091120 未加载

评论 #39095744 未加载

评论 #39091462 未加载

viraptorover 1 year ago

Since llava is multimodal, I wonder if there's a chance here to strip a bit of complexity. Specifically, instead of going through 3 embeddings (llava internal, text, mini-lm), could you use the not-last layer of llava as your vector? It would probably require a bit of fine-tuning though.For pure text, that's kind of how e5-mistral works <a href="https://huggingface.co/intfloat/e5-mistral-7b-instruct" rel="nofollow">https://huggingface.co/intfloat/e5-mistral-7b-instruct</a> Or yeah, just use clip like another commenter suggests...

dmezzettiover 1 year ago

Here is an example that builds a vector index of images using the CLIP model.<a href="https://neuml.hashnode.dev/similarity-search-with-images" rel="nofollow">https://neuml.hashnode.dev/similarity-search-with-images</a>This allows queries with both text and images.

clordover 1 year ago

Is anyone aware of a model that is trained to give photos a quality rating? I have decades of RAW files sitting on my server that I would love to pass over and tag those that are worth developing more. Would be nice to make a short list.

评论 #39094105 未加载

评论 #39093872 未加载

评论 #39097927 未加载

GaggiXover 1 year ago

For indexing images is probably convenient to directly calculate the embeddings using CLIP image encoder and retrieve them using the CLIP text encoder.

评论 #39088532 未加载

评论 #39090241 未加载

reacharavindhover 1 year ago

A nice work. I’m thinking it could even be tinkered further by incorporating location information, date and time, and even people (facial recognition) data from the photos, and have an LLM write one “metadata text” for every photo. This way one can query “ person X traveling with Y to Norway about 7 years ago” and quickly get useful results.

评论 #39088387 未加载

vladgurover 1 year ago

This is pretty awesome, but I’m curious if it can be used to “enhance” the existing iCloud search which is great at identifying people in my photos even kids as they age.I would not want to lose that functionality

digganover 1 year ago

Slightly related, are there any good photo management alternatives to Photoprism that leverages more recent AI/ML technologies and provides a GUI for end users?

say_it_as_it_isover 1 year ago

I really appreciate itch scratching posts like these. The life story is as important as the workflow.

voiper1over 1 year ago

Is there a state of the art for face matching? I love being able to put in a name and find all the photos they are in.I don't even mind some training of "are these the same or not"That's one of the conveniences that means I'm still using google photos...

评论 #39098117 未加载

评论 #39095136 未加载

behnamohover 1 year ago

I'm still tryna understand the difference between multimodal models like Llava and projects like JARVIS that connect LLMs to other huggingface models (including object detection models) or clip. Is a multimodal model doing this under the hood?

评论 #39088600 未加载

12 comments

warangalover 1 year ago

评论 #39089426 未加载

评论 #39089609 未加载

jsmith99over 1 year ago

评论 #39091120 未加载

评论 #39095744 未加载

评论 #39091462 未加载

viraptorover 1 year ago

dmezzettiover 1 year ago

clordover 1 year ago

评论 #39094105 未加载

评论 #39093872 未加载

评论 #39097927 未加载

GaggiXover 1 year ago

For indexing images is probably convenient to directly calculate the embeddings using CLIP image encoder and retrieve them using the CLIP text encoder.

评论 #39088532 未加载

评论 #39090241 未加载

reacharavindhover 1 year ago

评论 #39088387 未加载

vladgurover 1 year ago

digganover 1 year ago

Slightly related, are there any good photo management alternatives to Photoprism that leverages more recent AI/ML technologies and provides a GUI for end users?

say_it_as_it_isover 1 year ago

I really appreciate itch scratching posts like these. The life story is as important as the workflow.

Indexing iCloud Photos with AI Using LLaVA and Pgvector

12 comments

Indexing iCloud Photos with AI Using LLaVA and Pgvector

12 comments