TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

168 点作者 julien_c超过 5 年前

14 条评论

julien_c超过 5 年前
TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast &amp; versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).<p>Main features: - Encode 1GB in 20sec - Provide BPE&#x2F;Byte-Level-BPE&#x2F;WordPiece&#x2F;SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.js<p>Github repository and doc: <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;tokenizers&#x2F;tree&#x2F;master&#x2F;tokenizers" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;tokenizers&#x2F;tree&#x2F;master&#x2F;tokeni...</a><p>To install: - Rust: <a href="https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;tokenizers" rel="nofollow">https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;tokenizers</a> - Python: pip install tokenizers - Node: npm install tokenizers
mark_l_watson超过 5 年前
I love the work done and made freely available by both spaCy and HuggingFace.<p>I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.<p>I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.
评论 #22039625 未加载
评论 #22041645 未加载
screye超过 5 年前
I can&#x27;t believe the level of productivity this Hugging face team has.<p>They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.<p>Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.
评论 #22038506 未加载
ZeroCool2u超过 5 年前
We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy&#x27;s tokenizer[1]?<p>1. <a href="https:&#x2F;&#x2F;spacy.io&#x2F;usage&#x2F;linguistic-features#tokenization" rel="nofollow">https:&#x2F;&#x2F;spacy.io&#x2F;usage&#x2F;linguistic-features#tokenization</a>
LunaSea超过 5 年前
It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn&#x27;t seem to exist anymore in the word embedding tokenizers I&#x27;ve come by.<p>Is this possible using HuggingFace (or another word embedding based library)?<p>I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.
评论 #22037194 未加载
useful超过 5 年前
Somewhat related, if someone want to build something awesome, I haven&#x27;t seen anything that merges lucene with BPE&#x2F;SentencePiece.<p>SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.
hnaccy超过 5 年前
Great! Just did a quick test and got a 6-7x speedup on tokenization.
评论 #22039072 未加载
orestis超过 5 年前
Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.
评论 #22038795 未加载
评论 #22037549 未加载
echelon超过 5 年前
I&#x27;m very familiar with the TTS, VC, and other &quot;audio-shaped&quot; spaces, but I&#x27;ve never delved into NLP.<p>What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?<p>What cool problems are there?
评论 #22037785 未加载
评论 #22037221 未加载
评论 #22037246 未加载
评论 #22037500 未加载
m0zg超过 5 年前
Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?
virtuous_signal超过 5 年前
I didn&#x27;t realize that particular emoji had a name. I thought it was a play on this: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Alien_(creature_in_Alien_franchise)#Facehugger" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Alien_(creature_in_Alien_franc...</a>
评论 #22040803 未加载
manojlds超过 5 年前
Title is off? Should mention Tokenizers as the project.
rsp1984超过 5 年前
What does tokenization (of strings, I guess) do?
评论 #22036746 未加载
tarr11超过 5 年前
Why is this company called HuggingFace?
评论 #22038537 未加载