TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Most efficient way to test if a string contains a word?

7 pointsby bladeaodalmost 16 years ago
For an app I wrote I need to check if a String contains a word, finding the longest word it contains. Currently I check the whole string against a dictionary, then check every substring possible with decreasing length until I find a word.<p>Is there a more efficient way to do this?

4 comments

Xichekolasalmost 16 years ago
EDIT: Revised running time and added code (in Ruby)<p>Build a Trie out of your dictionary. Then plug your string into and record depth/chars where it deadends. Repeat for you string with the first char removed. At the end you should have the longest substring.<p>Minor optimization would be to stop plugging things into the Trie once the remaining characters in the string are less than the longest word you have already found.<p>This should run in linear time (relative to your search string) because the recursion is limited by the longest word in the dictionary (depth of the trie). I have tested this out just for fun.<p>First I grabbed the Wikipedia page for 'line of succession to the British throne' and stripped it down to text only (352,000 chars). Then I added 'monarch' to my dictionary. Then I took the first 10k chars of the text and put it into my trie, getting back monarch as my match and timing it. Then I figured the ratio of time taken to size of string (10k). I repeated this for 20k chars, 30k chars ... 350k chars, and the ratio remains constant, indicating time increases in proportion to string length... linear time.<p><pre><code> DICTIONARY = %w(a ants apple bear bicycle) class LameTrie def initialize(dict) @trie = {} dict.each {|word| insert(word) } # obviously it'd be nice to only build this once # and then write it out to disk, since our # dictionary probably won't change much end def insert(word, level = @trie) c, rest = h_t(word) level[c] = {} unless level.has_key?(c) if rest.empty? level[c][:term] = true else insert(rest, level[c]) end end def longest_word(str) result, n = '', str.length n.times do |i| t = longest?(str[i..-1]) result = t if t.size &#62; result.size break if result.size &#62; n-i # our micro optimization end result end def longest?(str, level = @trie) return '' if str.empty? or str.nil? c, rest = h_t(str) if level.has_key?(c) x = longest?(rest, level[c]) if level[c].has_key?(:term) || !x.empty? c + x else '' end else '' end end def h_t(str) return str[0,1], str[1..-1] end end t = LameTrie.new(DICTIONARY) puts "Dictionary: #{DICTIONARY.inspect}" puts '[emptystring] -&#62; ' + t.longest_word('') %w(a an antler crab crabapple pants polarbear adntsants oteuhbicyclentehut bearclaw).each do |str| puts str + ' -&#62; ' + t.longest_word(str) end</code></pre>
评论 #660721 未加载
ajucalmost 16 years ago
I'll use radix tree for dictionary, to speed up testing.<p>For words (son, sun, so, sunny) it would be:<p><pre><code> root -&#62; S -&#62; O -&#62; . \ \-&#62; N -&#62; . \-&#62; U -&#62; N -&#62; . \-&#62; N -&#62; Y -&#62; . </code></pre> Then - for each character in given string I'll make pointer pointing at root of dictionary tree.<p>Then in loop I would advance down each pointer in turn, until it's no longer possible to go down the tree, or the until the characters for given pointer ended.<p>When no pointer can be advanced, the pointer that gives longest word is the best.
bladeaodalmost 16 years ago
So by the nature of the app, the string I am checking will often increase by 1 letter, and then I check the string again. Right now I save a list of strings I have checked, because it is quicker to see if a string in in that small list than in the dictionary, then once I find a word I clear the list of strings I have checked. Is there any thing else like this I could do to speed things up so I am not constantly checking if the same strings are a word? Should I not be clearing that list so often?
评论 #660412 未加载
mbrubeckalmost 16 years ago
Just to be clear, is this a valid example of the output you're looking for?<p><pre><code> longest_word("oteuhbicyclentehut") =&#62; "bicycle"</code></pre>
评论 #660402 未加载