科技回声

7 条评论

This is interesting work. We at Plasticity (YC S17) have open sourced something similar called Magnitude (<a href="https://github.com/plasticityai/magnitude" rel="nofollow">https://github.com/plasticityai/magnitude</a>) a few months ago for querying vector embeddings quickly with low memory usage using SQLite and a standard universal file format (.magnitude) between word2vec, fastText, and GloVe. We also added features for out-of-vocabulary lookups, bulk queries, concatenating embeddings from multiple models, etc. It may also be of interest to folks looking at this.We also have a paper on Magnitude we will be presenting at EMNLP 2018 and ELMo support is coming soon!

评论 #17924405 未加载

评论 #17923864 未加载

评论 #17926990 未加载

评论 #17933725 未加载

danieldk超过 6 年前

Note that the original word2vec binary format is extremely bad for low-memory use. It stores the words and vectors interleaved (and words are obviously of a variable length). Of course, you could circumvent this problem by building a separate (sorted) index file.However, newer formats, such as the fastText binary format, store the embedding matrix contiguously. In such formats you can just memory-map the embedding matrix and you only have to load the vocabulary into memory. This is even simpler than the approach described here and in the Delft README, you don't have any serialization/deserialization overhead [1], you can let the OS decide how much to cache in memory, and you have one dependency less (mmap is in POSIX [2]).[1] Of course, if you have a system with different endianness, you have to do byte swapping.[2] <a href="http://pubs.opengroup.org/onlinepubs/7908799/xsh/mmap.html" rel="nofollow">http://pubs.opengroup.org/onlinepubs/7908799/xsh/mmap.html</a>

评论 #17925699 未加载

评论 #17924409 未加载

评论 #17924583 未加载

visarga超过 6 年前

You can also speed up the loading of embeddings by using BPE (byte pair encoding) to segment words into a smaller dictionary of char ngrams, and learning ngram embeddings instead of words.You can replace a list of 500K words with 50K ngrams, and it also works on unseen words and agglutinative languages such as German. It's interesting that it can both join together frequent words or split into pieces infrequent words, depending on the distribution of characters. Another advantage is that the ngram embedding size is much smaller, thus making it easy to deploy on resource constrained systems such as mobile phones.Neural Machine Translation of Rare Words with Subword Units<a href="https://arxiv.org/abs/1508.07909a" rel="nofollow">https://arxiv.org/abs/1508.07909a</a>A Python library for BPE ngrams: sentencepiece<a href="https://github.com/google/sentencepiece" rel="nofollow">https://github.com/google/sentencepiece</a>

评论 #17926562 未加载

评论 #17932015 未加载

spennihana超过 6 年前

I almost went a mem-mapped approach for a version I wrote in java that uses fork-join: <a href="https://github.com/spennihana/FasterWordEmbeddings" rel="nofollow">https://github.com/spennihana/FasterWordEmbeddings</a>Edit: this work uses a different byte-layout + parallel reader that heaves the word vecs into memory as compressed byte arrays. Load time is seconds (haven’t benchmarked with current ssds). Memory footprint is on the order of the size of your word vecs (memory is cheap for me, but could easily be extended to support mem-mapping if memory resources are scarce).

infocollector超过 6 年前

Please add a license to the source code? Thanks.

评论 #17937616 未加载

atrudeau超过 6 年前

Very nice work! :) Are there any benchmarks available? I'm curious how this compares to caching frequent word vectors (Zipf's law helps here) and disk-seeking the rest.

评论 #17924158 未加载

guybedo超过 6 年前

so far i've been using rocksdb for this use case, is there any benchmark that would compare both dbs ?

评论 #17923415 未加载

7 条评论

patelajay285超过 6 年前

评论 #17924405 未加载

评论 #17923864 未加载

评论 #17926990 未加载

评论 #17933725 未加载

danieldk超过 6 年前

评论 #17925699 未加载

评论 #17924409 未加载

评论 #17924583 未加载

visarga超过 6 年前

评论 #17926562 未加载

评论 #17932015 未加载

spennihana超过 6 年前

infocollector超过 6 年前

Please add a license to the source code? Thanks.

评论 #17937616 未加载

atrudeau超过 6 年前

Very nice work! :) Are there any benchmarks available? I'm curious how this compares to caching frequent word vectors (Zipf's law helps here) and disk-seeking the rest.

评论 #17924158 未加载

guybedo超过 6 年前

so far i've been using rocksdb for this use case, is there any benchmark that would compare both dbs ?

评论 #17923415 未加载

Fast word vectors with little memory usage in Python

7 条评论

Fast word vectors with little memory usage in Python

7 条评论