I have a good amount of experience in natural language processing and machine learning, and I don't think offering an API that provides easy access to the algorithms is the right solution. The major algorithms in text classification aren't that complex to implement, and can be done in a few hundred lines. Moreover, all of the most widely used, widely tested, and reliable algorithms have public implementations that are readily adaptable to your needs. And that's the problem: <i>understanding your needs</i>.<p>Understanding your needs (or your company's needs) is where people with PhDs make their money. Machine learning isn't a panacea, and we won't be seeing a one-size-fits-all approach for awhile. Even though data has become more accessible, it might be noisy, incomplete, streaming, partially labeled, etc. This is why understanding exactly what you're trying to model with these algorithms is crucial and why "just applying" them is impractical at best and misleading at worst.
I've found in my data-mining experience that the most interesting data (at least on the Web) is not particularly easy to parse, even if you write something that automates a form's POST submissions. The second difficult part is normalizing it, as much web/text data is formatted for <i>display</i> to humans, which is quite different than data in easily analyzable form.<p>So given that, it's just worth learning enough program to do loops, conditionals, and regexes to get what you want.
Text mining is one of my specialties and I have had similar ideas for a business. One thing that has stopped me is the awesome (and free for about 50K API calls a day) Open Calais service that does entity extraction and identifies some relationships between entities in input text.<p>For document clustering there are many good open source tools that people and companies can use. The commercial Ling Pipe product does a good job at sentiment analysis.<p>Obtaining, scrubbing, and generally curating the data is a pain point that users of this system may still need to worry about.<p>I wish this new business good luck, but there are definitely some real problems to work around. Perhaps we should go into business together :-)
Text mining: most of the time is spent on gathering the data, curating the data, and working with your annotators (domain experts). After that, you try <i>a dozen or more</i> ways to covert documents into a matrix format. Then, you try <i>a dozen or more</i> feature selection algorithms. Finally, the icing on the cake: you get to try <i>a dozen or more</i> machine learning algorithms, each having <i>a dozen or more parameters</i> to be estimated.<p>Yep, it would be very nice to have an API that would do all that for you. But that would require a group of at least 10 ML experts + 10 NLP experts + 20 domain experts. Still, I think it's doable and one should make small efforts to make it happen.<p>Marginal thoughts: <i>decision trees</i> are very bad for large p >> n problems - random forest might work, though. If TextMinr doesn't have radial SVM with auto-tuning then it will not cope with more difficult problems.
Great! I was just investigating AlchemyAPI and OpenCalais this weekend. I look forward to trying TextMinr.<p>TextMinr seems like a combination of those services along with the idea of 80legs.com? Is that correct?
I'm really happy to see more people moving into this space.<p>I've used a number of different systems (openCalais, AlchemyAPI, Zemanta...) in a variety of projects (Sentiment analysis, document classification...), and what I've found thus far is that while each system works extremely well within some restricted application classes, none come close to being general purpose APIs for the myriad applications developers try to throw at them.<p>A couple of pain points I've encountered are requiring a larger than expected corpus to generate meaningful data based on overly broad scope of the platform's analysis, or the lack of ability to apply negative signals from external sources. I find there tends to remain a large quantity of logic sitting rather redundantly on the application end to post-filter what's generated.<p>I don't pretend to understand the level of complexity involved or what's being worked on currently (not an NLP guy), but I do think there's a huge space to create publicly available text mining which can more effectively be applied to narrow domains.
This probably won't work, as google found out with it's prediction api (it hasn't been used much). There's already enough open source software out there that's state of the art and easy to use.<p>There's good business to be had in selling data though which is where these folks should probably divert their effort.
I honestly don't think this is feasible without regular expressions. There's very many minor details that make data mining work which have to be custom tailored to different solutions.