科技回声

13 条评论

minimaxir超过 1 年前

Some notes on how embeddings/DistilBERT embeddings work since the other comments are confused:1) There are two primary ways to have models generate embeddings: implicitly from an LLM by mean-pooling its last hidden state since it has to learn how to map text in a distinct latent space anyways to work correctly (i.e. DistilBERT), or you can use a model which can generate embeddings directly which are trained using something like triplet loss to explicitly incentivise learning similarity/dissimilarity. Popular text-embedding models like BAAI/bge-large-en-v1.5 tend to use the latter approach.2) The famous word2Vec examples of e.g. woman + king = queen only work because word2vec is a shallow network and the model learns the word embeddings directly, instead of it being emergent. The latent space still maps them closely as shown with this demo, but there isn't any algebraic intuition. You can get close with algebra but no cigar.3) DistilBERT is pretty old (2019) and based on a 2018 model trained on Wikipedia and books, so there will be significant text drift in addition to being less robust with newer modeling techniques and a more robust dataset. I do not recommend using it for production applications nowadays.4) There is an under-discussed opportunity for dimensionality reduction techniques like PCA (which this demo uses to get the data into 3D) to both improve signal-to-noise and improve distinctiveness. I am working on a blog post of a new technique to handle dimensionality reduction for text embeddings better which may have interesting and profound usability implications.

评论 #38818647 未加载

tikimcfee超过 1 年前

Edit: I think this is fascinating. If you use words, like dog, electric, life, and human, all of them appear in one mass however, the words like greet, chicken, and “a“ appear in a different mass density section. I think it’s interesting that the words have diverged in location, with some seeming relationship in the way, the words are used. If this were truly random, I would expect those words to be mixed into the other ones.I have this except you can see every single word in any dictionary at once in space, it renders individual glyphs. It can show an entire dictionary of words - definitions and roots - and let you fly around in them. It’s fun. I built a sample that “plays” a sentence and its definitions. GitHub.com/tikimcfee/LookAtThat The more I see stuff like this, the more i want to complete it. It’s heartening to see so many people fancied with seeing words… I just wish I knew where to find these people to like.. befriend and get better. Im getting the feeling I just kinda exist between worlds of lofty ideas and people that are incredibly smart sticking around other people that are incredibly smart.

wrsh07超过 1 年前

I wish there were more context and maybe the ability to do math on the vectorsEg what is the real distance between the two vectors? That should be easy to computeSimilarly: what do I get from summing two vectors and what are some nearby vectors?Maybe just generally: what are some nearby vectors?Without any additional context it's just a point cloud with a couple of randomly labeled elements

评论 #38818088 未加载

评论 #38816654 未加载

granawkins超过 1 年前

Hey guys, I'm the bored SOB who built this. Thanks for the awesome discussion, a lot of you know more about this than I do!I hadn't planned to keep building this but if I do, what should I add/change?

评论 #38820891 未加载

评论 #38821036 未加载

chaxor超过 1 年前

Typically these types of single word embedding visualizations work much better with non contextualized models such as the more traditional gensim or w2v approaches, as contextual encoder-based embedding models like BERT don't 'bake in' as much to the token (word) itself, and rather rely on its context to define it. Also, often PCA for contextual models like BERT end up with $PC_0$ aligned with the length of the document.

kvakkefly超过 1 年前

By running the same multiple times, I get different visualization. I don't really understand what's going on, but I like the idea of visualizing embeddings.

评论 #38816436 未加载

thom超过 1 年前

Seems mostly nonsensical, not sure if that's a bug or some deeper point I'm missing.

pamelafox超过 1 年前

I’m looking for more resources like this that attempt to visually explain vectors, as I’ll be giving some talks around vector search. Does anyone have related suggestions?

tetris11超过 1 年前

Interesting that "cromulent" and "hentai" seem to map right next to each other, as well as the words "decorate" and "spare".

评论 #38816759 未加载

eurekin超过 1 年前

I added those in succession:> man woman king queen ruler force powerful careand couldn't reliably determine position of any of them

smrtinsert超过 1 年前

I would love a quickest path between two words. For example between color and colour

评论 #38817468 未加载

larodi超过 1 年前

Is this with some sort of dimensionality reduction of the embedding space?

评论 #38815879 未加载

cuttysnark超过 1 年前

edge of the galaxy: 'if when that then wherever where while for'

评论 #38820388 未加载

13 条评论

minimaxir超过 1 年前

评论 #38818647 未加载

tikimcfee超过 1 年前

wrsh07超过 1 年前

评论 #38818088 未加载

评论 #38816654 未加载

granawkins超过 1 年前

评论 #38820891 未加载

评论 #38821036 未加载

chaxor超过 1 年前

kvakkefly超过 1 年前

By running the same multiple times, I get different visualization. I don't really understand what's going on, but I like the idea of visualizing embeddings.

评论 #38816436 未加载

thom超过 1 年前

Seems mostly nonsensical, not sure if that's a bug or some deeper point I'm missing.

pamelafox超过 1 年前

I’m looking for more resources like this that attempt to visually explain vectors, as I’ll be giving some talks around vector search. Does anyone have related suggestions?

tetris11超过 1 年前

Interesting that "cromulent" and "hentai" seem to map right next to each other, as well as the words "decorate" and "spare".

评论 #38816759 未加载

eurekin超过 1 年前

I added those in succession:> man woman king queen ruler force powerful careand couldn't reliably determine position of any of them

smrtinsert超过 1 年前

I would love a quickest path between two words. For example between color and colour

Latent Dictionary: 3D map of Oxford3000+search words via DistilBert embeddings

13 条评论

Latent Dictionary: 3D map of Oxford3000+search words via DistilBert embeddings

13 条评论