Jina AI launches open-source 8k text embedding

563 pointsby artex_xhover 1 year ago

29 comments

pietzover 1 year ago

I'm always happy to see OSS contributions but I don't quite understand why this model is so remarkable. As the leaderboard suggests it's ranking lower than OpenAI embeddings, while 14 other contributions are even better than that. Many of which feature a comparable or lower dimensionality than 768.The 8k context window is new, but isn't the 512 token limitation a soft limit anyway? I'm pretty sure I can stuff bigger documents into BGE for example.Furthermore, I think that most (all?) benchmarks in the MTEB leaderboard deal with very small documents. So there is nothing here that validates how well this model does on larger documents. If anything, I'd pick a higher ranking model because I put little trust in one that only ranks 17th on small documents. Should I expect it to magically get better when the documents get larger?Plus, you can expect that this model was designed to perform well on the datasets in MTEB while the OpenAI model probably wasn't.Many also stated that a 8k context embeddings will not be very useful in list situations.When would anyone use this model?

评论 #38024437 未加载

评论 #38024106 未加载

评论 #38026784 未加载

评论 #38025954 未加载

评论 #38025966 未加载

burcsover 1 year ago

This is great news!It feels like open-source is closing the gap with "Open"AI which is really exciting, and the acceleration towards parity is faster than more advancements made on the closed source models. Maybe it's wishful thinking though?

评论 #38021387 未加载

评论 #38024306 未加载

评论 #38023854 未加载

jncratonover 1 year ago

This is great to see. It looks like the size of the embedding vector is half the size of text-embedding-ada-002 (768 vs 1536) while providing competitive performance. This will save space in databases and make lookups somewhat faster.For those unaware, if 512 tokens of context is sufficient for your use case, there are already many options that outperform text-embedding-ada-002 on common benchmarks:<a href="https://huggingface.co/spaces/mteb/leaderboard" rel="nofollow noreferrer">https://huggingface.co/spaces/mteb/leaderboard</a>

评论 #38020878 未加载

andy99over 1 year ago

What is the use case for an 8k token embedding? My (somewhat limited) experience with long context models is they aren't great for RAG. I get the impression they are optimized for something else, like writing 8k+ tokens rather than synthesizing responses.Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt? Where is it better to have coarser retrieval?

评论 #38021364 未加载

评论 #38021029 未加载

评论 #38020994 未加载

egorfineover 1 year ago

One thing that is missing in comparison: OpenAI's model is multilingual.And not only it supports and embeds a variety of languages, it also computes the same coordinates for the same semantics in different languages. I.e. if you embed "russia is a terrorist state" and "россия - страна-террорист", both of these embeddings will have almost the same coordinates.

评论 #38037466 未加载

评论 #38028328 未加载

do-meover 1 year ago

Just quantized the models for onnx usage in e.g. transformers.js and got 4x reduced file size:- 𝟐𝟖.𝟓 𝐌𝐁 jina-embeddings-v2-small-en (<a href="https://huggingface.co/do-me/jina-embeddings-v2-small-en" rel="nofollow noreferrer">https://huggingface.co/do-me/jina-embeddings-v2-small-en</a>)- 𝟏𝟎𝟗 𝐌𝐁 jina-embeddings-v2-base-en (<a href="https://huggingface.co/do-me/jina-embeddings-v2-base-en" rel="nofollow noreferrer">https://huggingface.co/do-me/jina-embeddings-v2-base-en</a>)However, I noted, that the base model is performing quite poorly on small text chunks (a few words) while the small version seems to be unaffected. Might this be some kind of side effect due to the way they deal with large contexts?If you want to test, you can head over to SemanticFinder (<a href="https://do-me.github.io/SemanticFinder/" rel="nofollow noreferrer">https://do-me.github.io/SemanticFinder/</a>), go to advanced settings, choose the Jina AI base model (at the very bottom) and run with "Find". You'll see that all other models perform just fine and find "food"-related chunks but the base version doesn't.

评论 #38031648 未加载

simonwover 1 year ago

I just shipped a new llm-embed-jina plugin for my LLM tool which provides access to these new Jina models: <a href="https://github.com/simonw/llm-embed-jina">https://github.com/simonw/llm-embed-jina</a>Here's how to try it out.First, install LLM. Use pip or pipx or brew:<pre><code> brew install llm </code></pre> Next install the new plugin:<pre><code> llm install llm-embed-jina </code></pre> You can confirm the new models are now available to LLM by running:<pre><code> llm embed-models </code></pre> You should see a list that includes "jina-embeddings-v2-small-en" and "jina-embeddings-v2-base-en"To embed a string using the small model, run this:<pre><code> llm embed -m jina-embeddings-v2-small-en -c 'Hello world' </code></pre> That will output a JSON array of 512 floating point numbers (see my explainer here for what those are: <a href="https://simonwillison.net/2023/Oct/23/embeddings/#what-are-embeddings" rel="nofollow noreferrer">https://simonwillison.net/2023/Oct/23/embeddings/#what-are-e...</a>)Embeddings are only really interesting if you store them and use them for comparisons.Here's how to use the "llm embed-multi" command to create embeddings for the 30 most recent issues in my LLM GitHub repository:<pre><code> curl 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \ | jq '[.[] | {id: .id, title: .title}]' \ | llm embed-multi -m jina-embeddings-v2-small-en jina-llm-issues - \ --store </code></pre> This creates a collection called "jina-llm-issues" in a default SQLite database on your machine (the path to that can be found using "llm collections path").To search for issues in that collection with titles most similar to the term "bug":<pre><code> llm similar jina-llm-issues -c 'bug' </code></pre> Or for issues most similar to another existing issue by ID:<pre><code> llm similar jina-llm-issues 1922688957 </code></pre> Full documentation on what you can do with LLM and embeddings here: <a href="https://llm.datasette.io/en/stable/embeddings/index.html" rel="nofollow noreferrer">https://llm.datasette.io/en/stable/embeddings/index.html</a>Alternative recipe - this creates embeddings for every single README.md in the current directory and its subdirectories. Run this somewhere with a node_modules folder and you should get a whole lot of interesting stuff:<pre><code> llm embed-multi jina-readmes \ -m jina-embeddings-v2-small-en \ --files . '**/README.md' --store </code></pre> Then search them like this:<pre><code> llm similar jina-readmes -c 'backup tools'</code></pre>

评论 #38021434 未加载

评论 #38021264 未加载

评论 #38020674 未加载

评论 #38024037 未加载

评论 #38024154 未加载

评论 #38020939 未加载

评论 #38022125 未加载

marinheroover 1 year ago

How well do LLMS like this work with a non-English language? Or are these open source models limited to English?

评论 #38021545 未加载

评论 #38021991 未加载

评论 #38021831 未加载

omneityover 1 year ago

Impressive work.I wonder what would be the best way to use 8k embeddings. It’s a lot of information to keep in a vector, so things like “precision” of the embedding space and its ability to distinguish very similar large documents will be key.Maybe it can be useful for coarse similarity matching, for example to detect plagiarism?

评论 #38020416 未加载

moralestapiaover 1 year ago

Ada is one of the (if not the) worst model offered by OpenAI, though ...

评论 #38020804 未加载

luke-stanleyover 1 year ago

When I go to this leaderboard: <a href="https://huggingface.co/spaces/mteb/leaderboard" rel="nofollow noreferrer">https://huggingface.co/spaces/mteb/leaderboard</a> I click on the "Classification" tab, then I see "jina-embeddings-v2-base-en" at number 12, with an average score of 73.45. But the highest scoring model there is llmrails/ember-v1 with 75.99 average score but it only supports 512 tokens, so if you need 8K tokens to be embedded, I guess they are the best. Do people need 8K of tokens for embedding? Maybe not but they might need more than 512 often enough. It could save a summary extraction step.

评论 #38024466 未加载

woofwoofwoofover 1 year ago

Just noticed that they (jina.ai) have offices both in Berlin and China. I am wondering how they will they operate with the presence of chip export restrictions and other side effects of USA / China tensions.

评论 #38024096 未加载

itronitronover 1 year ago

It's weird to think there are entire companies built around providing access to a pre-computed vector space model.

nicognawover 1 year ago

Jina AI itself is also a great framework to expose APIs from deep neural net models and deploy them to Kubernetes clusters, which I think is very promising, but they didn't get as much hype as I predicted that they deserved.

dylanjcastilloover 1 year ago

I wonder how much better is this, compared to taking the average ( or some other aggregation) of embeddings with a smaller context length. Has anyone done a similar comparison?

评论 #38022630 未加载

Kutsuyaover 1 year ago

this is super cool! I wish there was an easy to understand and follow guide on how to make your own embedding, for llama2 for example. All I can find are various guides that already assume you know everything there is to training an embedding.I just want to make an embedding between a conversation of me and my friend and simulate talking to them. Is this a hard thing to train to begin with?If anyone knows or could help me with this, I would be very grateful!

评论 #38024220 未加载

tayo42over 1 year ago

You can't fine tune without using their library tied to their cloud? Did I misunderstand? Do you need fine tune?

评论 #38020901 未加载

backendEngineerover 1 year ago

oh thank god I first read Jira...

评论 #38027666 未加载

pknerdover 1 year ago

Pardon my ignorance in advance but could it be used to "chat" with PDFs and websites? I am looking for OpenAI alternatives as I am in learning phase

评论 #38021696 未加载

评论 #38021349 未加载

评论 #38021670 未加载

评论 #38023190 未加载

Nitroloover 1 year ago

Is there something like oobabooga to easily run this in a click-and-run way? Where I can load up a model, a text, and ask it questions?

评论 #38020706 未加载

评论 #38020975 未加载

评论 #38020581 未加载

srousseyover 1 year ago

Does anyone know what they are using for this comparison and ranking? And where does instruct-xl stand in the mix?

评论 #38020445 未加载

extasiaover 1 year ago

Is this a text encoder model, BERT style?

neximo64over 1 year ago

Does it match OpenAI on number of params?

评论 #38020918 未加载

nwhnwhover 1 year ago

What does this even do?

评论 #38024007 未加载

3cats-in-a-coatover 1 year ago

Great company name.

评论 #38023561 未加载

e1gover 1 year ago

Their OpenAI benchmark is GPT3 (text-embedding-ada-002), not GPT4.

评论 #38020701 未加载

andrewstuartover 1 year ago

Anyone got links to examples of text embedding?

评论 #38020554 未加载

评论 #38020630 未加载

Zuiiiover 1 year ago

Color me surprised! it looks like its actually open source (Apache 2.0) and not the usual false advertising by some two-faced company or institution. Links here:* <a href="https://huggingface.co/jinaai/jina-embeddings-v2-base-en" rel="nofollow noreferrer">https://huggingface.co/jinaai/jina-embeddings-v2-base-en</a> * <a href="https://huggingface.co/jinaai/jina-embeddings-v2-small-en" rel="nofollow noreferrer">https://huggingface.co/jinaai/jina-embeddings-v2-small-en</a>

评论 #38021873 未加载

RossBencinaover 1 year ago

Some relevant stats from the link:8192 token input sequence length768 embedding dimensions0.27GB model (with 0.07GB model also available)Tokeniser: BertTokenizer [1], 30528 token vocab [2]Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.[1] <a href="https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blob/main/tokenizer_config.json" rel="nofollow noreferrer">https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...</a>[2] <a href="https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blob/main/vocab.txt" rel="nofollow noreferrer">https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...</a>

评论 #38020767 未加载

评论 #38021130 未加载

评论 #38020930 未加载

评论 #38024253 未加载

评论 #38021877 未加载