TechEcho

14 comments

julien_cover 5 years ago

TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast & versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.jsGithub repository and doc: <a href="https://github.com/huggingface/tokenizers/tree/master/tokenizers" rel="nofollow">https://github.com/huggingface/tokenizers/tree/master/tokeni...</a>To install: - Rust: <a href="https://crates.io/crates/tokenizers" rel="nofollow">https://crates.io/crates/tokenizers</a> - Python: pip install tokenizers - Node: npm install tokenizers

mark_l_watsonover 5 years ago

I love the work done and made freely available by both spaCy and HuggingFace.I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.

评论 #22039625 未加载

评论 #22041645 未加载

screyeover 5 years ago

I can't believe the level of productivity this Hugging face team has.They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.

评论 #22038506 未加载

ZeroCool2uover 5 years ago

We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy's tokenizer[1]?1. <a href="https://spacy.io/usage/linguistic-features#tokenization" rel="nofollow">https://spacy.io/usage/linguistic-features#tokenization</a>

LunaSeaover 5 years ago

It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn't seem to exist anymore in the word embedding tokenizers I've come by.Is this possible using HuggingFace (or another word embedding based library)?I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.

评论 #22037194 未加载

usefulover 5 years ago

Somewhat related, if someone want to build something awesome, I haven't seen anything that merges lucene with BPE/SentencePiece.SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.

hnaccyover 5 years ago

Great! Just did a quick test and got a 6-7x speedup on tokenization.

评论 #22039072 未加载

orestisover 5 years ago

Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.

评论 #22038795 未加载

评论 #22037549 未加载

echelonover 5 years ago

I'm very familiar with the TTS, VC, and other "audio-shaped" spaces, but I've never delved into NLP.What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?What cool problems are there?

评论 #22037785 未加载

评论 #22037221 未加载

评论 #22037246 未加载

评论 #22037500 未加载

m0zgover 5 years ago

Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?

virtuous_signalover 5 years ago

I didn't realize that particular emoji had a name. I thought it was a play on this: <a href="https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franchise)#Facehugger" rel="nofollow">https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...</a>

评论 #22040803 未加载

manojldsover 5 years ago

Title is off? Should mention Tokenizers as the project.

rsp1984over 5 years ago

What does tokenization (of strings, I guess) do?

评论 #22036746 未加载

tarr11over 5 years ago

Why is this company called HuggingFace?

评论 #22038537 未加载

14 comments

julien_cover 5 years ago

mark_l_watsonover 5 years ago

评论 #22039625 未加载

评论 #22041645 未加载

screyeover 5 years ago

评论 #22038506 未加载

ZeroCool2uover 5 years ago

LunaSeaover 5 years ago

评论 #22037194 未加载

usefulover 5 years ago

hnaccyover 5 years ago

Great! Just did a quick test and got a 6-7x speedup on tokenization.

评论 #22039072 未加载

orestisover 5 years ago

Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.

评论 #22038795 未加载

评论 #22037549 未加载

echelonover 5 years ago

评论 #22037785 未加载

评论 #22037221 未加载

评论 #22037246 未加载

评论 #22037500 未加载

m0zgover 5 years ago

Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?

virtuous_signalover 5 years ago

评论 #22040803 未加载

manojldsover 5 years ago

Title is off? Should mention Tokenizers as the project.

rsp1984over 5 years ago

What does tokenization (of strings, I guess) do?

评论 #22036746 未加载

tarr11over 5 years ago

Why is this company called HuggingFace?

评论 #22038537 未加载

Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

14 comments

Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

14 comments