科技回声

Methods like collaborative filtering via matrix factorization, Word2vec, Doc2vec, etc.. map large, sparse matrices into a low-dimensional vector space while enforcing similarity constraints. There are extensions for vectorizing various modalities (users, items, documents, audio, images, etc) into one vector-space for similarity search and recommendation ([1], [2], [3]). There is extensive research on approximate nearest-neighbor searches ([4]).For example: it's possible to map users, songs, and artists into a common vector space ([1]). Two users who listen to similar songs have high similarity. Songs are recommended based on vector similarity to users. This pattern extends to many domains as long as there is a way to enforce similarity (likes, co-occurrences, etc.) to "train" the vectors.In my experience, training the vectors is simpler than the engineering to efficiently query them (e.g. "select the 10 nearest neighbors to vector with ID 123"). This becomes expensive for large datasets, and correctly using the approximate nearest neighbor libraries is non-trivial.I can't find any database to insert vectors as they're computed and then run queries against them. It seems often companies build a custom API on top of one of the approximate nearest neighbors libraries. Though the interesting queries seem pretty homogeneous.Any ideas as to why none of the big DB players have an offering for this use-case? Like Algolia, but for vectors instead of text? Any recommendations for such a product?[1] IHeartRadio queries various modalities of data from the same vector space: https://youtu.be/jjO1gOH-BW4?t=5m39s [2] Using a convnet to map new (cold-start) songs into an existing vector space: http://benanne.github.io/2014/08/05/spotify-cnns.html [3] Flickr similarity search: http://code.flickr.net/2017/03/07/introducing-similarity-search-at-flickr/ [4] Benchmarks for approximate nearest neighbor libs: https://github.com/erikbern/ann-benchmarks

2 条评论

PaulHoule超过 7 年前

Hyperdimensional nearest-neighbor search is a tough problem; there are index algorithms such as ball trees that work, but they don't deliver the big wins that b-trees give in 1-d space, quadtrees in 2-d space, etc.In many "as a service" offerings computational costs are not a big deal. For this one it would be, thus making the pricing work right for everybody would be a toughie.

billconan超过 7 年前

I thought about word2vec as a service. I gave up because I think customers could easily cache (pirate) my data.

评论 #15714051 未加载

2 条评论

PaulHoule超过 7 年前

billconan超过 7 年前

I thought about word2vec as a service. I gave up because I think customers could easily cache (pirate) my data.

评论 #15714051 未加载

Ask HN: Vector-space database (as a service)?

2 条评论

Ask HN: Vector-space database (as a service)?

2 条评论