TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

168 pointsby julien_cover 5 years ago

14 comments

julien_cover 5 years ago
TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast &amp; versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).<p>Main features: - Encode 1GB in 20sec - Provide BPE&#x2F;Byte-Level-BPE&#x2F;WordPiece&#x2F;SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.js<p>Github repository and doc: <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;tokenizers&#x2F;tree&#x2F;master&#x2F;tokenizers" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;tokenizers&#x2F;tree&#x2F;master&#x2F;tokeni...</a><p>To install: - Rust: <a href="https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;tokenizers" rel="nofollow">https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;tokenizers</a> - Python: pip install tokenizers - Node: npm install tokenizers
mark_l_watsonover 5 years ago
I love the work done and made freely available by both spaCy and HuggingFace.<p>I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.<p>I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.
评论 #22039625 未加载
评论 #22041645 未加载
screyeover 5 years ago
I can&#x27;t believe the level of productivity this Hugging face team has.<p>They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.<p>Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.
评论 #22038506 未加载
ZeroCool2uover 5 years ago
We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy&#x27;s tokenizer[1]?<p>1. <a href="https:&#x2F;&#x2F;spacy.io&#x2F;usage&#x2F;linguistic-features#tokenization" rel="nofollow">https:&#x2F;&#x2F;spacy.io&#x2F;usage&#x2F;linguistic-features#tokenization</a>
LunaSeaover 5 years ago
It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn&#x27;t seem to exist anymore in the word embedding tokenizers I&#x27;ve come by.<p>Is this possible using HuggingFace (or another word embedding based library)?<p>I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.
评论 #22037194 未加载
usefulover 5 years ago
Somewhat related, if someone want to build something awesome, I haven&#x27;t seen anything that merges lucene with BPE&#x2F;SentencePiece.<p>SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.
hnaccyover 5 years ago
Great! Just did a quick test and got a 6-7x speedup on tokenization.
评论 #22039072 未加载
orestisover 5 years ago
Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.
评论 #22038795 未加载
评论 #22037549 未加载
echelonover 5 years ago
I&#x27;m very familiar with the TTS, VC, and other &quot;audio-shaped&quot; spaces, but I&#x27;ve never delved into NLP.<p>What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?<p>What cool problems are there?
评论 #22037785 未加载
评论 #22037221 未加载
评论 #22037246 未加载
评论 #22037500 未加载
m0zgover 5 years ago
Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?
virtuous_signalover 5 years ago
I didn&#x27;t realize that particular emoji had a name. I thought it was a play on this: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Alien_(creature_in_Alien_franchise)#Facehugger" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Alien_(creature_in_Alien_franc...</a>
评论 #22040803 未加载
manojldsover 5 years ago
Title is off? Should mention Tokenizers as the project.
rsp1984over 5 years ago
What does tokenization (of strings, I guess) do?
评论 #22036746 未加载
tarr11over 5 years ago
Why is this company called HuggingFace?
评论 #22038537 未加载