TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

How to Split Sentences (2014)

88 点作者 f00biebletch大约 10 年前

4 条评论

diasks2大约 10 年前
I did an analysis of different sentence segmentation tools when I was working on my own rule-based segmenter. The results can be found in this README (<a href="https:&#x2F;&#x2F;github.com&#x2F;diasks2&#x2F;pragmatic_segmenter" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;diasks2&#x2F;pragmatic_segmenter</a>).<p>I think this blog post almost hits on the key in the middle - in my opinion it is important to test (all of) the edge cases. The problem with most corpora typically used to test segmenters is that 80-90% of the sentences are the same (i.e. a regular sentence ending in a period). Thus if a segmenter just simply split the sentence at every period it would still show a 80-90% accuracy rate. This is why I am trying to develop a standardized set of edge cases: <a href="https:&#x2F;&#x2F;github.com&#x2F;diasks2&#x2F;pragmatic_segmenter#the-golden-rules" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;diasks2&#x2F;pragmatic_segmenter#the-golden-ru...</a>
评论 #9296828 未加载
评论 #9296882 未加载
评论 #9297604 未加载
kylebgorman大约 10 年前
Self-promotion: I wrote an open-source sentence splitter tool that outperforms the state of the art on the &quot;standard split&quot;. It is also very fast.<p><a href="http:&#x2F;&#x2F;sonny.cslu.ohsu.edu&#x2F;~gormanky&#x2F;blog&#x2F;simpler-sentence-boundary-detection&#x2F;" rel="nofollow">http:&#x2F;&#x2F;sonny.cslu.ohsu.edu&#x2F;~gormanky&#x2F;blog&#x2F;simpler-sentence-b...</a> (link to GitHub repo in post)
jbrooksuk大约 10 年前
I&#x27;ve just added this kind of support to node-summary (<a href="https:&#x2F;&#x2F;github.com&#x2F;jbrooksuk&#x2F;node-summary" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jbrooksuk&#x2F;node-summary</a>) which seems to make a bit of a positive difference under the tests.
andrewtbham大约 10 年前
is anyone aware of a sentence segmenter for poorly written english that is missing some punctation? like from chat sessions? it could be useful for normal sentence segmentation. i.e. if you forget about the punctuation, can you detect the boundaries of the sentence anyway.