TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Indexing iCloud Photos with AI Using LLaVA and Pgvector

208 pointsby CSDudeover 1 year ago

12 comments

warangalover 1 year ago
I think image-encoder from CLIP (even smallest variant ViT B&#x2F;32) is good enough to capture a lot of semantic information to allow natural language query once images are indexed. A lot of work actually goes into integrating with existing meta-data like local-directory, date-time to augment NL query and re-ranking the results.<p>I work on such a tool[0] to enable end to end indexing of user&#x27;s personal photos and recently added functionality to index Google Photos too!<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;eagledot&#x2F;hachi">https:&#x2F;&#x2F;github.com&#x2F;eagledot&#x2F;hachi</a>
评论 #39089426 未加载
评论 #39089609 未加载
jsmith99over 1 year ago
Immich (self hosted Google photos alternative) has been using CLIP models for smart search for a while and anecdotally seems to work really well - it indexes fast and results are of similar quality to the giant SaaS providers.
评论 #39091120 未加载
评论 #39095744 未加载
评论 #39091462 未加载
viraptorover 1 year ago
Since llava is multimodal, I wonder if there&#x27;s a chance here to strip a bit of complexity. Specifically, instead of going through 3 embeddings (llava internal, text, mini-lm), could you use the not-last layer of llava as your vector? It would probably require a bit of fine-tuning though.<p>For pure text, that&#x27;s kind of how e5-mistral works <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;intfloat&#x2F;e5-mistral-7b-instruct" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;intfloat&#x2F;e5-mistral-7b-instruct</a> Or yeah, just use clip like another commenter suggests...
dmezzettiover 1 year ago
Here is an example that builds a vector index of images using the CLIP model.<p><a href="https:&#x2F;&#x2F;neuml.hashnode.dev&#x2F;similarity-search-with-images" rel="nofollow">https:&#x2F;&#x2F;neuml.hashnode.dev&#x2F;similarity-search-with-images</a><p>This allows queries with both text and images.
clordover 1 year ago
Is anyone aware of a model that is trained to give photos a quality rating? I have decades of RAW files sitting on my server that I would love to pass over and tag those that are worth developing more. Would be nice to make a short list.
评论 #39094105 未加载
评论 #39093872 未加载
评论 #39097927 未加载
GaggiXover 1 year ago
For indexing images is probably convenient to directly calculate the embeddings using CLIP image encoder and retrieve them using the CLIP text encoder.
评论 #39088532 未加载
评论 #39090241 未加载
reacharavindhover 1 year ago
A nice work. I’m thinking it could even be tinkered further by incorporating location information, date and time, and even people (facial recognition) data from the photos, and have an LLM write one “metadata text” for every photo. This way one can query “ person X traveling with Y to Norway about 7 years ago” and quickly get useful results.
评论 #39088387 未加载
vladgurover 1 year ago
This is pretty awesome, but I’m curious if it can be used to “enhance” the existing iCloud search which is great at identifying people in my photos even kids as they age.<p>I would not want to lose that functionality
digganover 1 year ago
Slightly related, are there any good photo management alternatives to Photoprism that leverages more recent AI&#x2F;ML technologies and provides a GUI for end users?
say_it_as_it_isover 1 year ago
I really appreciate itch scratching posts like these. The life story is as important as the workflow.
voiper1over 1 year ago
Is there a state of the art for face matching? I love being able to put in a name and find all the photos they are in.<p>I don&#x27;t even mind some training of &quot;are these the same or not&quot;<p>That&#x27;s one of the conveniences that means I&#x27;m still using google photos...
评论 #39098117 未加载
评论 #39095136 未加载
behnamohover 1 year ago
I&#x27;m still tryna understand the difference between multimodal models like Llava and projects like JARVIS that connect LLMs to other huggingface models (including object detection models) or clip. Is a multimodal model doing this under the hood?
评论 #39088600 未加载