(What the article calls) Canonisation is an interesting issue - it seems very hard to get right and to do so efficiently.<p>I'm trying to wrap my head around a the issue in a regex parser that I've knocked up.<p>I'm currently ending up (expectedly) with multiple nodes representing equivalent languages; I want to strip these out when I use code generation to convert the constructed network into a switch based static automaton.<p>The most robust way to see whether two languages are the same is to xor them and test for interminability - but this means comparing each pair of nodes throughout the network, and I'd rather avoid the n^2 scaling if there's another option.<p>That option is to generate a canonical expression for the language that the machine represents, somewhat difficult but far more efficient when it comes to detecting collisions.<p><a href="https://en.wikipedia.org/wiki/Canonization" rel="nofollow">https://en.wikipedia.org/wiki/Canonization</a><p><a href="https://en.wikipedia.org/wiki/Canonicalization" rel="nofollow">https://en.wikipedia.org/wiki/Canonicalization</a>