For an app I wrote I need to check if a String contains a word, finding the longest word it contains.
Currently I check the whole string against a dictionary, then check every substring possible with decreasing length until I find a word.<p>Is there a more efficient way to do this?
EDIT: Revised running time and added code (in Ruby)<p>Build a Trie out of your dictionary. Then plug your string into and record depth/chars where it deadends. Repeat for you string with the first char removed. At the end you should have the longest substring.<p>Minor optimization would be to stop plugging things into the Trie once the remaining characters in the string are less than the longest word you have already found.<p>This should run in linear time (relative to your search string) because the recursion is limited by the longest word in the dictionary (depth of the trie). I have tested this out just for fun.<p>First I grabbed the Wikipedia page for 'line of succession to the British throne' and stripped it down to text only (352,000 chars). Then I added 'monarch' to my dictionary. Then I took the first 10k chars of the text and put it into my trie, getting back monarch as my match and timing it. Then I figured the ratio of time taken to size of string (10k). I repeated this for 20k chars, 30k chars ... 350k chars, and the ratio remains constant, indicating time increases in proportion to string length... linear time.<p><pre><code> DICTIONARY = %w(a ants apple bear bicycle)
class LameTrie
def initialize(dict)
@trie = {}
dict.each {|word| insert(word) }
# obviously it'd be nice to only build this once
# and then write it out to disk, since our
# dictionary probably won't change much
end
def insert(word, level = @trie)
c, rest = h_t(word)
level[c] = {} unless level.has_key?(c)
if rest.empty?
level[c][:term] = true
else
insert(rest, level[c])
end
end
def longest_word(str)
result, n = '', str.length
n.times do |i|
t = longest?(str[i..-1])
result = t if t.size > result.size
break if result.size > n-i # our micro optimization
end
result
end
def longest?(str, level = @trie)
return '' if str.empty? or str.nil?
c, rest = h_t(str)
if level.has_key?(c)
x = longest?(rest, level[c])
if level[c].has_key?(:term) || !x.empty?
c + x
else
''
end
else
''
end
end
def h_t(str)
return str[0,1], str[1..-1]
end
end
t = LameTrie.new(DICTIONARY)
puts "Dictionary: #{DICTIONARY.inspect}"
puts '[emptystring] -> ' + t.longest_word('')
%w(a an antler crab crabapple pants polarbear adntsants oteuhbicyclentehut bearclaw).each do |str|
puts str + ' -> ' + t.longest_word(str)
end</code></pre>
I'll use radix tree for dictionary, to speed up testing.<p>For words (son, sun, so, sunny) it would be:<p><pre><code> root -> S -> O -> .
\ \-> N -> .
\-> U -> N -> .
\-> N -> Y -> .
</code></pre>
Then - for each character in given string I'll make pointer pointing at root of dictionary tree.<p>Then in loop I would advance down each pointer in turn, until it's no longer possible to go down the tree, or the until the characters for given pointer ended.<p>When no pointer can be advanced, the pointer that gives longest word is the best.
So by the nature of the app, the string I am checking will often increase by 1 letter, and then I check the string again. Right now I save a list of strings I have checked, because it is quicker to see if a string in in that small list than in the dictionary, then once I find a word I clear the list of strings I have checked. Is there any thing else like this I could do to speed things up so I am not constantly checking if the same strings are a word? Should I not be clearing that list so often?
Just to be clear, is this a valid example of the output you're looking for?<p><pre><code> longest_word("oteuhbicyclentehut") => "bicycle"</code></pre>