TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Indexing iCloud Photos with AI Using LLaVA and Pgvector

208 点作者 CSDude超过 1 年前

12 条评论

warangal超过 1 年前
I think image-encoder from CLIP (even smallest variant ViT B&#x2F;32) is good enough to capture a lot of semantic information to allow natural language query once images are indexed. A lot of work actually goes into integrating with existing meta-data like local-directory, date-time to augment NL query and re-ranking the results.<p>I work on such a tool[0] to enable end to end indexing of user&#x27;s personal photos and recently added functionality to index Google Photos too!<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;eagledot&#x2F;hachi">https:&#x2F;&#x2F;github.com&#x2F;eagledot&#x2F;hachi</a>
评论 #39089426 未加载
评论 #39089609 未加载
jsmith99超过 1 年前
Immich (self hosted Google photos alternative) has been using CLIP models for smart search for a while and anecdotally seems to work really well - it indexes fast and results are of similar quality to the giant SaaS providers.
评论 #39091120 未加载
评论 #39095744 未加载
评论 #39091462 未加载
viraptor超过 1 年前
Since llava is multimodal, I wonder if there&#x27;s a chance here to strip a bit of complexity. Specifically, instead of going through 3 embeddings (llava internal, text, mini-lm), could you use the not-last layer of llava as your vector? It would probably require a bit of fine-tuning though.<p>For pure text, that&#x27;s kind of how e5-mistral works <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;intfloat&#x2F;e5-mistral-7b-instruct" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;intfloat&#x2F;e5-mistral-7b-instruct</a> Or yeah, just use clip like another commenter suggests...
dmezzetti超过 1 年前
Here is an example that builds a vector index of images using the CLIP model.<p><a href="https:&#x2F;&#x2F;neuml.hashnode.dev&#x2F;similarity-search-with-images" rel="nofollow">https:&#x2F;&#x2F;neuml.hashnode.dev&#x2F;similarity-search-with-images</a><p>This allows queries with both text and images.
clord超过 1 年前
Is anyone aware of a model that is trained to give photos a quality rating? I have decades of RAW files sitting on my server that I would love to pass over and tag those that are worth developing more. Would be nice to make a short list.
评论 #39094105 未加载
评论 #39093872 未加载
评论 #39097927 未加载
GaggiX超过 1 年前
For indexing images is probably convenient to directly calculate the embeddings using CLIP image encoder and retrieve them using the CLIP text encoder.
评论 #39088532 未加载
评论 #39090241 未加载
reacharavindh超过 1 年前
A nice work. I’m thinking it could even be tinkered further by incorporating location information, date and time, and even people (facial recognition) data from the photos, and have an LLM write one “metadata text” for every photo. This way one can query “ person X traveling with Y to Norway about 7 years ago” and quickly get useful results.
评论 #39088387 未加载
vladgur超过 1 年前
This is pretty awesome, but I’m curious if it can be used to “enhance” the existing iCloud search which is great at identifying people in my photos even kids as they age.<p>I would not want to lose that functionality
diggan超过 1 年前
Slightly related, are there any good photo management alternatives to Photoprism that leverages more recent AI&#x2F;ML technologies and provides a GUI for end users?
say_it_as_it_is超过 1 年前
I really appreciate itch scratching posts like these. The life story is as important as the workflow.
voiper1超过 1 年前
Is there a state of the art for face matching? I love being able to put in a name and find all the photos they are in.<p>I don&#x27;t even mind some training of &quot;are these the same or not&quot;<p>That&#x27;s one of the conveniences that means I&#x27;m still using google photos...
评论 #39098117 未加载
评论 #39095136 未加载
behnamoh超过 1 年前
I&#x27;m still tryna understand the difference between multimodal models like Llava and projects like JARVIS that connect LLMs to other huggingface models (including object detection models) or clip. Is a multimodal model doing this under the hood?
评论 #39088600 未加载