TechEcho

4 comments

diasks2about 10 years ago

I did an analysis of different sentence segmentation tools when I was working on my own rule-based segmenter. The results can be found in this README (<a href="https://github.com/diasks2/pragmatic_segmenter" rel="nofollow">https://github.com/diasks2/pragmatic_segmenter</a>).<p>I think this blog post almost hits on the key in the middle - in my opinion it is important to test (all of) the edge cases. The problem with most corpora typically used to test segmenters is that 80-90% of the sentences are the same (i.e. a regular sentence ending in a period). Thus if a segmenter just simply split the sentence at every period it would still show a 80-90% accuracy rate. This is why I am trying to develop a standardized set of edge cases: <a href="https://github.com/diasks2/pragmatic_segmenter#the-golden-rules" rel="nofollow">https://github.com/diasks2/pragmatic_segmenter#the-golden-ru...</a>

评论 #9296828 未加载

评论 #9296882 未加载

评论 #9297604 未加载

kylebgormanabout 10 years ago

Self-promotion: I wrote an open-source sentence splitter tool that outperforms the state of the art on the "standard split". It is also very fast.<p><a href="http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-boundary-detection/" rel="nofollow">http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-b...</a> (link to GitHub repo in post)

jbrooksukabout 10 years ago

I've just added this kind of support to node-summary (<a href="https://github.com/jbrooksuk/node-summary" rel="nofollow">https://github.com/jbrooksuk/node-summary</a>) which seems to make a bit of a positive difference under the tests.

andrewtbhamabout 10 years ago

is anyone aware of a sentence segmenter for poorly written english that is missing some punctation? like from chat sessions? it could be useful for normal sentence segmentation. i.e. if you forget about the punctuation, can you detect the boundaries of the sentence anyway.

How to Split Sentences (2014)

4 comments

How to Split Sentences (2014)

4 comments