TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast & versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).<p>Main features:
- Encode 1GB in 20sec
- Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece...
- Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...)
- Written in Rust with bindings for Python and node.js<p>Github repository and doc: <a href="https://github.com/huggingface/tokenizers/tree/master/tokenizers" rel="nofollow">https://github.com/huggingface/tokenizers/tree/master/tokeni...</a><p>To install:
- Rust: <a href="https://crates.io/crates/tokenizers" rel="nofollow">https://crates.io/crates/tokenizers</a>
- Python: pip install tokenizers
- Node: npm install tokenizers
I love the work done and made freely available by both spaCy and HuggingFace.<p>I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.<p>I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.
I can't believe the level of productivity this Hugging face team has.<p>They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.<p>Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.
We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy's tokenizer[1]?<p>1. <a href="https://spacy.io/usage/linguistic-features#tokenization" rel="nofollow">https://spacy.io/usage/linguistic-features#tokenization</a>
It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn't seem to exist anymore in the word embedding tokenizers I've come by.<p>Is this possible using HuggingFace (or another word embedding based library)?<p>I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.
Somewhat related, if someone want to build something awesome, I haven't seen anything that merges lucene with BPE/SentencePiece.<p>SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.
Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.
I'm very familiar with the TTS, VC, and other "audio-shaped" spaces, but I've never delved into NLP.<p>What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?<p>What cool problems are there?
I didn't realize that particular emoji had a name. I thought it was a play on this: <a href="https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franchise)#Facehugger" rel="nofollow">https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...</a>