科技回声

14 条评论

julien_c超过 5 年前

TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast & versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.jsGithub repository and doc: <a href="https://github.com/huggingface/tokenizers/tree/master/tokenizers" rel="nofollow">https://github.com/huggingface/tokenizers/tree/master/tokeni...</a>To install: - Rust: <a href="https://crates.io/crates/tokenizers" rel="nofollow">https://crates.io/crates/tokenizers</a> - Python: pip install tokenizers - Node: npm install tokenizers

mark_l_watson超过 5 年前

I love the work done and made freely available by both spaCy and HuggingFace.I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.

评论 #22039625 未加载

评论 #22041645 未加载

screye超过 5 年前

I can't believe the level of productivity this Hugging face team has.They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.

评论 #22038506 未加载

ZeroCool2u超过 5 年前

We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy's tokenizer[1]?1. <a href="https://spacy.io/usage/linguistic-features#tokenization" rel="nofollow">https://spacy.io/usage/linguistic-features#tokenization</a>

LunaSea超过 5 年前

It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn't seem to exist anymore in the word embedding tokenizers I've come by.Is this possible using HuggingFace (or another word embedding based library)?I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.

评论 #22037194 未加载

useful超过 5 年前

Somewhat related, if someone want to build something awesome, I haven't seen anything that merges lucene with BPE/SentencePiece.SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.

hnaccy超过 5 年前

Great! Just did a quick test and got a 6-7x speedup on tokenization.

评论 #22039072 未加载

orestis超过 5 年前

Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.

评论 #22038795 未加载

评论 #22037549 未加载

echelon超过 5 年前

I'm very familiar with the TTS, VC, and other "audio-shaped" spaces, but I've never delved into NLP.What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?What cool problems are there?

评论 #22037785 未加载

评论 #22037221 未加载

评论 #22037246 未加载

评论 #22037500 未加载

m0zg超过 5 年前

Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?

virtuous_signal超过 5 年前

I didn't realize that particular emoji had a name. I thought it was a play on this: <a href="https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franchise)#Facehugger" rel="nofollow">https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...</a>

评论 #22040803 未加载

manojlds超过 5 年前

Title is off? Should mention Tokenizers as the project.

rsp1984超过 5 年前

What does tokenization (of strings, I guess) do?

评论 #22036746 未加载

tarr11超过 5 年前

Why is this company called HuggingFace?

评论 #22038537 未加载

14 条评论

julien_c超过 5 年前

mark_l_watson超过 5 年前

评论 #22039625 未加载

评论 #22041645 未加载

screye超过 5 年前

评论 #22038506 未加载

ZeroCool2u超过 5 年前

LunaSea超过 5 年前

评论 #22037194 未加载

useful超过 5 年前

hnaccy超过 5 年前

Great! Just did a quick test and got a 6-7x speedup on tokenization.

评论 #22039072 未加载

orestis超过 5 年前

Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.

评论 #22038795 未加载

评论 #22037549 未加载

echelon超过 5 年前

评论 #22037785 未加载

评论 #22037221 未加载

评论 #22037246 未加载

评论 #22037500 未加载

m0zg超过 5 年前

Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?

virtuous_signal超过 5 年前

评论 #22040803 未加载

manojlds超过 5 年前

Title is off? Should mention Tokenizers as the project.

rsp1984超过 5 年前

What does tokenization (of strings, I guess) do?

评论 #22036746 未加载

tarr11超过 5 年前

Why is this company called HuggingFace?

评论 #22038537 未加载

Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

14 条评论

Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

14 条评论