I am currently working on a project that requires me to store and efficiently query large amounts of multidimensional data, and I believe a vector database could provide the perfect solution. However, I am unsure which one would best integrate with ClickHouse.
If any of you have had experience using a vector database in conjunction with ClickHouse, I would be immensely grateful for your recommendations and insights.
You should check out <a href="https://github.com/marqo-ai/marqo">https://github.com/marqo-ai/marqo</a> for an end-to-end vector search database with batteries included.<p>Disclaimer, I'm from the Marqo team.
How large is the data size (the number of vectors, and their dimensions?), what are the type of queries (N nearest neighbors to a target vector according to L2 distance, or something else?), from where the queries are sent (reccomendation system for a user request; internal requests from a ML system), the throughput and the latency requirements (how many queries per second it should serve and how quickly it should answer)?<p>ClickHouse already works good for vector search.<p>For example, if you have one million of vectors of 1024 dimensions, and you search nearest vectors to a vector by brute force search, the query will take 150 ms, which is good for a reccomendation system scenario for e-commerce, food-tech, and similar applications.<p>Example:<p><pre><code> CREATE TABLE vectors (id UInt64, vector Array(Float32)) ENGINE = Memory;
SET max_block_size = 16; -- 64 KB per row
INSERT INTO vectors SELECT number, arrayMap(x -> randNormal(0.0, 1.0, x), range(1024)) FROM numbers_mt(1000000); -- 4 GiB
WITH (SELECT vector FROM vectors LIMIT 1) AS target
SELECT count() FROM vectors WHERE NOT ignore(L2SquaredDistance(vector, target)); -- 0.113
SELECT count() FROM vectors WHERE NOT ignore(L2Norm(vector)); -- 0.110
WITH (SELECT vector FROM vectors LIMIT 1) AS target
SELECT count() FROM vectors WHERE NOT ignore(arraySum((x, y) -> x * y, vector, target)); -- 0.150
WITH (SELECT vector FROM vectors LIMIT 1) AS target
SELECT id, L2SquaredDistance(vector, target) AS distance FROM vectors ORDER BY distance LIMIT 10; -- 0.144</code></pre>
I would recommend <a href="https://weaviate.io/" rel="nofollow">https://weaviate.io/</a>
Disclaimer, I work at Weaviate.<p>It is open source, super fast and really easy to work with.
Plus, it can easily handle huge volumes, we even have it running with a billion objects.
Ha, great question! Coincidentally, at Weaviate, we are thinking about this too! There are a few ways to do this with AirByte (<a href="https://docs.airbyte.com/integrations/destinations/weaviate/">https://docs.airbyte.com/integrations/destinations/weaviate/</a>) or (potentially) with Spark (<a href="https://github.com/weaviate/spark-connector">https://github.com/weaviate/spark-connector</a>). Would love to collaborate on this; feel free to reach out over Slack or so.
You can explore MyScale at <a href="https://myscale.com/" rel="nofollow">https://myscale.com/</a>. It's a SaaS platform built on ClickHouse, offering more sophisticated vector indexing options like HNSW and IVF compared to the open-source version. It also provides a free tier for beta users.<p>Disclaimer: I work for MyScale.
ClickHouse can actually store vectors as tuples or arrays.
It also comes with some handy distance functions<p><a href="https://clickhouse.com/docs/en/sql-reference/functions/distance-functions" rel="nofollow">https://clickhouse.com/docs/en/sql-reference/functions/dista...</a>
I've been using <a href="https://github.com/jdagdelen/hyperDB">https://github.com/jdagdelen/hyperDB</a> and it's been really easy to use. I think Clickhouse support is on the short-term roadmap.
You could check Qdrant, a dedicated Vector Database with advanced features. <a href="https://github.com/qdrant/qdrant">https://github.com/qdrant/qdrant</a>
Disclaimer: I'm from the Qdrant Team.