TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Making Text Mining Accessible to Any Developer & Non-Expert

80 pointsby wfalerover 13 years ago

10 comments

lawover 13 years ago
I have a good amount of experience in natural language processing and machine learning, and I don't think offering an API that provides easy access to the algorithms is the right solution. The major algorithms in text classification aren't that complex to implement, and can be done in a few hundred lines. Moreover, all of the most widely used, widely tested, and reliable algorithms have public implementations that are readily adaptable to your needs. And that's the problem: <i>understanding your needs</i>.<p>Understanding your needs (or your company's needs) is where people with PhDs make their money. Machine learning isn't a panacea, and we won't be seeing a one-size-fits-all approach for awhile. Even though data has become more accessible, it might be noisy, incomplete, streaming, partially labeled, etc. This is why understanding exactly what you're trying to model with these algorithms is crucial and why "just applying" them is impractical at best and misleading at worst.
评论 #3261934 未加载
评论 #3261638 未加载
dansoover 13 years ago
I've found in my data-mining experience that the most interesting data (at least on the Web) is not particularly easy to parse, even if you write something that automates a form's POST submissions. The second difficult part is normalizing it, as much web/text data is formatted for <i>display</i> to humans, which is quite different than data in easily analyzable form.<p>So given that, it's just worth learning enough program to do loops, conditionals, and regexes to get what you want.
评论 #3261614 未加载
mark_l_watsonover 13 years ago
Text mining is one of my specialties and I have had similar ideas for a business. One thing that has stopped me is the awesome (and free for about 50K API calls a day) Open Calais service that does entity extraction and identifies some relationships between entities in input text.<p>For document clustering there are many good open source tools that people and companies can use. The commercial Ling Pipe product does a good job at sentiment analysis.<p>Obtaining, scrubbing, and generally curating the data is a pain point that users of this system may still need to worry about.<p>I wish this new business good luck, but there are definitely some real problems to work around. Perhaps we should go into business together :-)
评论 #3262395 未加载
zeratulover 13 years ago
Text mining: most of the time is spent on gathering the data, curating the data, and working with your annotators (domain experts). After that, you try <i>a dozen or more</i> ways to covert documents into a matrix format. Then, you try <i>a dozen or more</i> feature selection algorithms. Finally, the icing on the cake: you get to try <i>a dozen or more</i> machine learning algorithms, each having <i>a dozen or more parameters</i> to be estimated.<p>Yep, it would be very nice to have an API that would do all that for you. But that would require a group of at least 10 ML experts + 10 NLP experts + 20 domain experts. Still, I think it's doable and one should make small efforts to make it happen.<p>Marginal thoughts: <i>decision trees</i> are very bad for large p &#62;&#62; n problems - random forest might work, though. If TextMinr doesn't have radial SVM with auto-tuning then it will not cope with more difficult problems.
评论 #3261664 未加载
vyrotekover 13 years ago
Great! I was just investigating AlchemyAPI and OpenCalais this weekend. I look forward to trying TextMinr.<p>TextMinr seems like a combination of those services along with the idea of 80legs.com? Is that correct?
评论 #3261371 未加载
tomwalshamover 13 years ago
I'm really happy to see more people moving into this space.<p>I've used a number of different systems (openCalais, AlchemyAPI, Zemanta...) in a variety of projects (Sentiment analysis, document classification...), and what I've found thus far is that while each system works extremely well within some restricted application classes, none come close to being general purpose APIs for the myriad applications developers try to throw at them.<p>A couple of pain points I've encountered are requiring a larger than expected corpus to generate meaningful data based on overly broad scope of the platform's analysis, or the lack of ability to apply negative signals from external sources. I find there tends to remain a large quantity of logic sitting rather redundantly on the application end to post-filter what's generated.<p>I don't pretend to understand the level of complexity involved or what's being worked on currently (not an NLP guy), but I do think there's a huge space to create publicly available text mining which can more effectively be applied to narrow domains.
itmagover 13 years ago
Machine learning seems to be a meme on the rise right now. What kind of startups are possible in that domain?
评论 #3261489 未加载
评论 #3261395 未加载
_gd3lover 13 years ago
This would really be useful on many, many levels. Why has this access to such data been so roped-off?
marshallpover 13 years ago
This probably won't work, as google found out with it's prediction api (it hasn't been used much). There's already enough open source software out there that's state of the art and easy to use.<p>There's good business to be had in selling data though which is where these folks should probably divert their effort.
评论 #3261474 未加载
suivixover 13 years ago
I honestly don't think this is feasible without regular expressions. There's very many minor details that make data mining work which have to be custom tailored to different solutions.