I did an analysis of different sentence segmentation tools when I was working on my own rule-based segmenter. The results can be found in this README (<a href="https://github.com/diasks2/pragmatic_segmenter" rel="nofollow">https://github.com/diasks2/pragmatic_segmenter</a>).<p>I think this blog post almost hits on the key in the middle - in my opinion it is important to test (all of) the edge cases. The problem with most corpora typically used to test segmenters is that 80-90% of the sentences are the same (i.e. a regular sentence ending in a period). Thus if a segmenter just simply split the sentence at every period it would still show a 80-90% accuracy rate. This is why I am trying to develop a standardized set of edge cases: <a href="https://github.com/diasks2/pragmatic_segmenter#the-golden-rules" rel="nofollow">https://github.com/diasks2/pragmatic_segmenter#the-golden-ru...</a>
Self-promotion: I wrote an open-source sentence splitter tool that outperforms the state of the art on the "standard split". It is also very fast.<p><a href="http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-boundary-detection/" rel="nofollow">http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-b...</a> (link to GitHub repo in post)
I've just added this kind of support to node-summary (<a href="https://github.com/jbrooksuk/node-summary" rel="nofollow">https://github.com/jbrooksuk/node-summary</a>) which seems to make a bit of a positive difference under the tests.
is anyone aware of a sentence segmenter for poorly written english that is missing some punctation? like from chat sessions? it could be useful for normal sentence segmentation. i.e. if you forget about the punctuation, can you detect the boundaries of the sentence anyway.