TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How to Split Sentences (2014)

88 pointsby f00biebletchabout 10 years ago

4 comments

diasks2about 10 years ago
I did an analysis of different sentence segmentation tools when I was working on my own rule-based segmenter. The results can be found in this README (<a href="https:&#x2F;&#x2F;github.com&#x2F;diasks2&#x2F;pragmatic_segmenter" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;diasks2&#x2F;pragmatic_segmenter</a>).<p>I think this blog post almost hits on the key in the middle - in my opinion it is important to test (all of) the edge cases. The problem with most corpora typically used to test segmenters is that 80-90% of the sentences are the same (i.e. a regular sentence ending in a period). Thus if a segmenter just simply split the sentence at every period it would still show a 80-90% accuracy rate. This is why I am trying to develop a standardized set of edge cases: <a href="https:&#x2F;&#x2F;github.com&#x2F;diasks2&#x2F;pragmatic_segmenter#the-golden-rules" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;diasks2&#x2F;pragmatic_segmenter#the-golden-ru...</a>
评论 #9296828 未加载
评论 #9296882 未加载
评论 #9297604 未加载
kylebgormanabout 10 years ago
Self-promotion: I wrote an open-source sentence splitter tool that outperforms the state of the art on the &quot;standard split&quot;. It is also very fast.<p><a href="http:&#x2F;&#x2F;sonny.cslu.ohsu.edu&#x2F;~gormanky&#x2F;blog&#x2F;simpler-sentence-boundary-detection&#x2F;" rel="nofollow">http:&#x2F;&#x2F;sonny.cslu.ohsu.edu&#x2F;~gormanky&#x2F;blog&#x2F;simpler-sentence-b...</a> (link to GitHub repo in post)
jbrooksukabout 10 years ago
I&#x27;ve just added this kind of support to node-summary (<a href="https:&#x2F;&#x2F;github.com&#x2F;jbrooksuk&#x2F;node-summary" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jbrooksuk&#x2F;node-summary</a>) which seems to make a bit of a positive difference under the tests.
andrewtbhamabout 10 years ago
is anyone aware of a sentence segmenter for poorly written english that is missing some punctation? like from chat sessions? it could be useful for normal sentence segmentation. i.e. if you forget about the punctuation, can you detect the boundaries of the sentence anyway.