TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Classifying 200k articles in 7 hours using NLP

152 点作者 rezamoaiandin将近 5 年前

11 条评论

smeeth将近 5 年前
This article is a bit of a Frankenstein monster. There are too many possible target audiences for ML blogs and it seems like this post made an attempt to please everybody. This isn&#x27;t a condemnation of the author, its just an impossible task.<p>1. Experienced ML practitioners will be unimpressed with the ML task generally (simple problem, no comparison with common models, no use of common dataset, no lit review) and wish that there was more detail in model design.<p>2. Inexperienced ML practitioners will be happy with the birds-eye view of NLP tasks but wish there were more implementation details.<p>3. Potential clients (non-technical) will get lost in the details&#x2F;lingo and wish there were case studies or a vision of what this service can accomplish for them&#x2F;their business.<p>4. Potential clients (technical) and SWEs will wish they got a better look at the GUI, got an explanation of the stack, and wonder about APIs&#x2F;integration with whatever it is they already do.<p>Perhaps this might explain why literally every other comment at the time I&#x27;m writing this is asking for additional details. Pick one or two!
评论 #23762428 未加载
jointpdf将近 5 年前
For those interested in related&#x2F;alternative approaches, one or more of the following established open-source libraries might appeal to you:<p>- Snorkel (training data curation, weak supervision, heuristic labeling functions, uncertainty sampling, relation extraction): <a href="https:&#x2F;&#x2F;github.com&#x2F;snorkel-team&#x2F;snorkel" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;snorkel-team&#x2F;snorkel</a><p>- AllenNLP (many pretrained NLP research models for tasks beyond text classification, model training and serving, visualization&#x2F;interpretability utilities): <a href="https:&#x2F;&#x2F;github.com&#x2F;allenai&#x2F;allennlp" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;allenai&#x2F;allennlp</a><p>- Spacy (tokenization, NER&#x2F;POS + tagging visualizer, pretrained word vectors, integration with DL models): <a href="https:&#x2F;&#x2F;github.com&#x2F;allenai&#x2F;allennlp" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;allenai&#x2F;allennlp</a><p>-huggingface Transformers (latest and greatest pretrained models, e.g. BERT): <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;transformers" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;transformers</a><p>- ...or a barebones “from scratch” solution in less than an hour with a Colab notebook and scikit-learn (preprocess text into tf-idf vectors, LSA&#x2F;NMF to generate “document embeddings”, visualize embeddings with t-SNE&#x2F;UMAP [facilitates weak supervision&#x2F;active learning], classify with LogReg&#x2F;RF&#x2F;SVM&#x2F;whatever). You could also tack on pretrained gensim&#x2F;TF&#x2F;PyTorch models quite easily as a next step. But this basic flow quickly gives you a handle on your corpus.<p>By the way, the docs for DeepDive (the predecessor of Snorkel) are some amazingly detailed background reading: <a href="http:&#x2F;&#x2F;deepdive.stanford.edu&#x2F;example-spouse" rel="nofollow">http:&#x2F;&#x2F;deepdive.stanford.edu&#x2F;example-spouse</a>
评论 #23766659 未加载
评论 #23762239 未加载
评论 #23763905 未加载
sixhobbits将近 5 年前
The title makes it sound like they talk about how they did it so efficiently.<p>But all the info we get about that is<p>&quot;The last step was to combine the four binary models into one multiclass model, as explained in the previous section, and use it to classify 1M new documents automatically. To do this, we simply went on the UI and uploaded a new list of documents.&quot;<p>Great intro to NLP article, but very light on the actual implementation details and dataset.
评论 #23760597 未加载
swayson将近 5 年前
This reminded me of a great OSS tool I discovered the other data for data labelling. It is called Label Studio (<a href="https:&#x2F;&#x2F;labelstud.io&#x2F;playground&#x2F;" rel="nofollow">https:&#x2F;&#x2F;labelstud.io&#x2F;playground&#x2F;</a>) and support quite a variety of different task formats. Works well<p>Disclaimer: No affiliation, only sharing for those who are curious
yunusabd将近 5 年前
I agree with the other posters that the intro to NLP part is unnecessary. It reads like those recipe websites where they tell you their whole life story before the actual recipe. I get that it&#x27;s good for SEO, but it&#x27;s still annoying to read.<p>Did you try other solutions like ULMFiT [1]? Seems like the exact use case for that. Although it might be overkill for just 4 categories.<p>[1] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1801.06146" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1801.06146</a>
评论 #23761166 未加载
pmayrgundter将近 5 年前
Glad to see work like this being shared!<p>There are some well known text classification datasets, e.g. the Reuters news dataset from David Lewis of Bell Labs:<p><pre><code> http:&#x2F;&#x2F;www.daviddlewis.com&#x2F;resources&#x2F;testcollections&#x2F;reuters21578&#x2F; </code></pre> More background here:<p><pre><code> https:&#x2F;&#x2F;link.springer.com&#x2F;content&#x2F;pdf&#x2F;bbm%3A978-3-642-04533-2%2F1.pdf </code></pre> Here&#x27;s a result from ReelTwo&#x27;s Classification System circa 2003 (Based on a bayesian learner; related to the U Waikato WEKA ML system) if you&#x27;d be up for comparison:<p><pre><code> https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20040606002449&#x2F;http:&#x2F;&#x2F;www.reeltwo.com&#x2F;datasets.html </code></pre> 10 categories 2,535 documents 15 build time (~170 docs&#x2F;sec; these were short news abstracts; see pdf for example) 0.9121 F-measure<p>Build Time is the time to load, model and evaluate (using Leave-One-Out evaluation) a dataset on a WinXP&#x2F;1GHz Celeron&#x2F;256MB computer. F-Measure is the micro-averaged F-Measure across all categories in the dataset.
ashish01将近 5 年前
I guess it’s cool. It will be interesting to compare this with using fasttext as a baseline for classification.
kgarten将近 5 年前
Maybe off topic ... Is Stanford ML expert some type of accreditation? How do you become a Stanford ML expert? :) Attending the (excellent) Stanford ML online course on Machine Learning or do I have to read an ML book on Stanford Campus?
评论 #23766461 未加载
blackbear_将近 5 年前
I wonder what classifier was used (assume neural network-based, given the figures), and how that compares to a simple baseline that uses bag-of-words, such as a linear model or naive Bayes. The examples look easy enough to be classified by matching keywords.
评论 #23760880 未加载
josephjrobison将近 5 年前
Is Sculpt AI available for public use? I see sculptintel.com is down.
评论 #23761140 未加载
_Microft将近 5 年前
&quot;This article has been written by Sculpt AI [...] in collaboration with Reza [Article’s author]&quot;, so we now use tools to write longer articles faster only to later feed them into tl;dr-bots&#x2F;summarizers to get the gist without having to read all of it. ;)
评论 #23761737 未加载