TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: How do you split strings (to get keywords)?

2 点作者 gopher将近 16 年前
First trial, one splits on whitespace, but this sucks on interpunction and special characters.<p>Second trial, you use a alpha-numeric whitelist and split on anything else, but what about umlauts? What about hebrew or cyrillic?<p>Third trial: split on characters &#60; 32, whitespace and interpunction characters; this works somehow but is ugly. What would you do to get keywords from a string?

5 条评论

TallGuyShort将近 16 年前
It depends very heavily on the origin of the string, as that would determine the special cases that needed to be dealt with. Can you provide more details?<p>edit: Based on what you said in your original post, I would say to have a list of possible delimiters (which would probably need to be added to for some time), and tokenize the string according to that, and discard any token that appears in a second list of words that don't matter (conjunctions, articles, prepositions, etc...). Before discarding said strings, you'd also want to check if they're operators used in your app, or anything like that.
评论 #733247 未加载
dannyr将近 16 年前
How about term extraction?<p><a href="http://developer.yahoo.com/search/content/V1/termExtraction.html" rel="nofollow">http://developer.yahoo.com/search/content/V1/termExtraction....</a>
mbrubeck将近 16 年前
<i>"Second trial, you use a alpha-numeric whitelist and split on anything else, but what about umlauts? What about hebrew or cyrillic?"</i><p>A multi-lingual version of this could use the Unicode "General Category" character classes (Letter, Mark, Number, Punctuation, Symbol, Separator, Other).
alanthonyc将近 16 年前
Not sure what your main goal is, but in my compilers project class, we used lexical analyzers to break out tokens from the input stream.<p>Try looking up "Lex" or "Flex"...these were the tools we used. There may be better ones around now.<p>Here's a quick google: <a href="http://dinosaur.compilertools.net/" rel="nofollow">http://dinosaur.compilertools.net/</a>
pedalpete将近 16 年前
I just found LingPipe on Monday, and haven't had a chance to try it yet. But it has 'entity extraction' in text mining. Not sure if that is what you're looking for. It's a Java library. <a href="http://alias-i.com/lingpipe/" rel="nofollow">http://alias-i.com/lingpipe/</a><p>anybody have any comments about it?