TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Finite state machines as data structures (2015)

8 pointsby heydenberkover 6 years ago

2 comments

unhammerover 6 years ago
(with proper formatting this time)<p>Note that it is perfectly possible to have more complicated values associated with FST keys, just not in the <a href="https:&#x2F;&#x2F;docs.rs&#x2F;fst&#x2F;0.3.3&#x2F;fst&#x2F;map&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;docs.rs&#x2F;fst&#x2F;0.3.3&#x2F;fst&#x2F;map&#x2F;index.html</a> implementation. FST&#x27;s <i>can also be cyclic</i> – this lets you represent things you couldn&#x27;t with just a hash table.<p>-----<p>Anyone who wants to play around with this should try the HFST[1] library, which lets you create compact and possibly cyclic string-to-string maps, which are closed under union, intersection, reversal, inversion, difference, composition. HFST makes it quite easy to do different operations on FST&#x27;s:<p><pre><code> $ echo &#x27;c a t 0:%+N 0:%+Sg&#x27; | hfst-regexp2fst &gt; cat.fst $ echo cat |hfst-lookup -q cat.fst cat cat+N+Sg 0,000000 </code></pre> (the default is for each arc to have input-equals-output, but you can use : to map inputs to outputs, and use % to escape special characters; 0 is epsilon&#x2F;&quot;nothing&quot;)<p><pre><code> $ echo cats |hfst-lookup -q cat.fst cats cats+? inf </code></pre> only singular, so let&#x27;s make one for the plural:<p><pre><code> $ echo &#x27;c a t 0:%+N s:%+Pl&#x27; | hfst-regexp2fst &gt; cats.fst $ echo cats |hfst-lookup -q cats.fst cats cat+N+Pl 0,000000 </code></pre> and combine them:<p><pre><code> $ hfst-union -1 cat.fst -2 cats.fst &gt;feline.fst $ hfst-fst2strings feline.fst cat:cat+N+Sg cats:cat+N+Pl </code></pre> and make it go from analysis to form:<p><pre><code> $ hfst-invert feline.fst &gt; ɟǝlᴉuǝ.fst $ echo &#x27;cat+N+Pl&#x27; | hfst-lookup -q ɟǝlᴉuǝ.fst cat+N+Pl cats 0,000000 $ hfst-fst2strings ɟǝlᴉuǝ.fst cat+N+Sg:cat cat+N+Pl:cats </code></pre> this is what the states look like:<p><pre><code> $ hfst-fst2txt feline.fst 0 1 c c 0.000000 0 6 @0@ @0@ 0.000000 1 2 a a 0.000000 2 3 t t 0.000000 3 4 @0@ +N 0.000000 4 5 @0@ +Sg 0.000000 5 0.000000 6 7 c c 0.000000 7 8 a a 0.000000 8 9 t t 0.000000 9 10 @0@ +N 0.000000 10 11 s +Pl 0.000000 11 0.000000 </code></pre> there&#x27;s some redundancy, so we should minimize it:<p><pre><code> $ hfst-minimize feline.fst &gt;min.fst $ hfst-fst2txt min.fst 0 1 c c 0.000000 1 2 a a 0.000000 2 3 t t 0.000000 3 4 @0@ +N 0.000000 4 5 @0@ +Sg 0.000000 4 5 s +Pl 0.000000 5 0.000000 </code></pre> Now let&#x27;s make an FST that turns slashes into +-es and increases the weight for every slash we see (~$[ a ] means anything-but-a):<p><pre><code> $ echo &#x27;%&#x2F;:%+ ~$[ %&#x2F; ]&#x27; | hfst-regexp2fst | hfst-reweight --end-states-only --addition=1 |hfst-fst2txt 0 1 &#x2F; + 0.000000 1 1 + + 0.000000 1 1 @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000 1 1.000000 $ echo &#x27;%&#x2F;:%+ ~$[ %&#x2F; ]&#x27; | hfst-regexp2fst | hfst-reweight --end-states-only --addition=1 | hfst-repeat &gt; dir.fst $ echo &#x2F;a | hfst-lookup -q dir.fst &#x2F;a +a 1,000000 $ echo &#x2F;ab&#x2F;c | hfst-lookup -q dir.fst &#x2F;ab&#x2F;c +ab+c 2,000000 $ echo &#x2F;ab&#x2F;c&#x2F;d&#x2F;&#x2F;e | hfst-lookup -q dir.fst &#x2F;ab&#x2F;c&#x2F;d&#x2F;&#x2F;e +ab+c+d++e 5,000000 </code></pre> – that&#x27;s not something you can do with (just) a hash table.<p>-----<p>On Debians you can install the package `giella-sme` which gives you a 37M cyclic FST of 587060 states and 1101943 arcs which turns North Sámi word forms into analyses. North Sámi has productive compounding, so e.g. &quot;school bus coffee&quot; is a word that I suppose someone might say:<p><pre><code> $ echo &#x27;skuvlabussegáfe&#x27; | hfst-lookup -q &#x2F;usr&#x2F;share&#x2F;giella&#x2F;sme&#x2F;analyser-disamb-gt-desc.hfstol |head -1 skuvlabussegáfe skuvlabusse+G3+Sem&#x2F;Veh+N+Err&#x2F;Orth+SgNomCmp+Cmp#gáfe+Sem&#x2F;Plant+N+Sg+Nom 10,000000 </code></pre> and there&#x27;s a bit of ambiguity in the analysis:<p><pre><code> $ echo &#x27;skuvlabussegáfe&#x27; | hfst-lookup -q &#x2F;usr&#x2F;share&#x2F;giella&#x2F;sme&#x2F;analyser-disamb-gt-desc.hfstol |wc -l 64 </code></pre> [1] <a href="https:&#x2F;&#x2F;github.com&#x2F;hfst&#x2F;hfst&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;hfst&#x2F;hfst&#x2F;</a> , using among others <a href="http:&#x2F;&#x2F;openfst.org&#x2F;twiki&#x2F;bin&#x2F;view&#x2F;FST&#x2F;WebHome" rel="nofollow">http:&#x2F;&#x2F;openfst.org&#x2F;twiki&#x2F;bin&#x2F;view&#x2F;FST&#x2F;WebHome</a> under the hood. `sudo apt install hfst` on Debians
评论 #18741876 未加载
unhammerover 6 years ago
Note that it is perfectly possible to have more complicated values associated with FST keys, just not in the <a href="https:&#x2F;&#x2F;docs.rs&#x2F;fst&#x2F;0.3.3&#x2F;fst&#x2F;map&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;docs.rs&#x2F;fst&#x2F;0.3.3&#x2F;fst&#x2F;map&#x2F;index.html</a> implementation. FST&#x27;s can also be cyclic – this lets you represent things you couldn&#x27;t with just a hash table.<p>Anyone who wants to play around with this should try the HFST[1] library, which lets you create compact and possibly cyclic string-to-string maps, which are closed under union, intersection, reversal, inversion, difference, composition. HFST makes it quite easy to do different operations on FST&#x27;s:<p>$ echo &#x27;c a t 0:%+N 0:%+Sg&#x27; | hfst-regexp2fst &gt; cat.fst $ echo cat |hfst-lookup -q cat.fst cat cat+N+Sg 0,000000<p>(the default is for each arc to have input-equals-output, but you can use : to map inputs to outputs, and use % to escape special characters; 0 is epsilon&#x2F;&quot;nothing&quot;)<p>$ echo cats |hfst-lookup -q cat.fst cats cats+? inf<p>$ echo &#x27;c a t 0:%+N s:%+Pl&#x27; | hfst-regexp2fst &gt; cats.fst $ echo cats |hfst-lookup -q cats.fst cats cat+N+Pl 0,000000<p>$ hfst-union -1 cat.fst -2 cats.fst &gt;feline.fst $ hfst-fst2strings feline.fst cat:cat+N+Sg cats:cat+N+Pl<p>$ hfst-invert feline.fst &gt; ɟǝlᴉuǝ.fst $ echo &#x27;cat+N+Pl&#x27; | hfst-lookup -q ɟǝlᴉuǝ.fst cat+N+Pl cats 0,000000 $ hfst-fst2strings ɟǝlᴉuǝ.fst cat+N+Sg:cat cat+N+Pl:cats<p>$ hfst-fst2txt feline.fst 0 1 c c 0.000000 0 6 @0@ @0@ 0.000000 1 2 a a 0.000000 2 3 t t 0.000000 3 4 @0@ +N 0.000000 4 5 @0@ +Sg 0.000000 5 0.000000 6 7 c c 0.000000 7 8 a a 0.000000 8 9 t t 0.000000 9 10 @0@ +N 0.000000 10 11 s +Pl 0.000000 11 0.000000<p>$ hfst-minimize feline.fst &gt;min.fst $ hfst-fst2txt min.fst 0 1 c c 0.000000 1 2 a a 0.000000 2 3 t t 0.000000 3 4 @0@ +N 0.000000 4 5 @0@ +Sg 0.000000 4 5 s +Pl 0.000000 5 0.000000<p>Now let&#x27;s make an FST that turns slashes into +-es and increases the weight for every slash we see (~$[ a ] means anything-but-a):<p>$ echo &#x27;%&#x2F;:%+ ~$[ %&#x2F; ]&#x27; | hfst-regexp2fst | hfst-reweight --end-states-only --addition=1 |hfst-fst2txt 0 1 &#x2F; + 0.000000 1 1 + + 0.000000 1 1 @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000 1 1.000000<p>$ echo &#x27;%&#x2F;:%+ ~$[ %&#x2F; ]&#x27; | hfst-regexp2fst | hfst-reweight --end-states-only --addition=1 | hfst-repeat &gt; dir.fst $ echo &#x2F;a | hfst-lookup -q dir.fst &#x2F;a +a 1,000000<p>$ echo &#x2F;ab&#x2F;c | hfst-lookup -q dir.fst &#x2F;ab&#x2F;c +ab+c 2,000000<p>$ echo &#x2F;ab&#x2F;c&#x2F;d&#x2F;&#x2F;e | hfst-lookup -q dir.fst &#x2F;ab&#x2F;c&#x2F;d&#x2F;&#x2F;e +ab+c+d++e 5,000000<p>On Debians you can install the package `giella-sme` which gives you a 37M cyclic FST of 587060 states and 1101943 arcs which turns North Sámi word forms into analyses. North Sámi has productive compounding, so e.g. &quot;school bus coffee&quot; is a word that I suppose someone might say:<p>$ echo &#x27;skuvlabussegáfe&#x27; | hfst-lookup -q &#x2F;usr&#x2F;share&#x2F;giella&#x2F;sme&#x2F;analyser-disamb-gt-desc.hfstol |head -1 skuvlabussegáfe skuvlabusse+G3+Sem&#x2F;Veh+N+Err&#x2F;Orth+SgNomCmp+Cmp#gáfe+Sem&#x2F;Plant+N+Sg+Nom 10,000000<p>and there&#x27;s a bit of ambiguity in the analysis:<p>$ echo &#x27;skuvlabussegáfe&#x27; | hfst-lookup -q &#x2F;usr&#x2F;share&#x2F;giella&#x2F;sme&#x2F;analyser-disamb-gt-desc.hfstol |wc -l 64<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;hfst&#x2F;hfst&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;hfst&#x2F;hfst&#x2F;</a> , using among others <a href="http:&#x2F;&#x2F;openfst.org&#x2F;twiki&#x2F;bin&#x2F;view&#x2F;FST&#x2F;WebHome" rel="nofollow">http:&#x2F;&#x2F;openfst.org&#x2F;twiki&#x2F;bin&#x2F;view&#x2F;FST&#x2F;WebHome</a> under the hood. `sudo apt install hfst` on Debians