This is a great article. It's pretty fun to play around with this heuristic:<p><pre><code> lambda c: 3*len(matches(c, uncovered)) - len(c)
</code></pre>
Here's a trivial way to explore it: say we generalize the heuristic to H(a, b).<p><pre><code> H(a,b) = lambda c: a*len(matches(c, uncovered)) - b*len(c)
</code></pre>
The original heuristic is considered H(3,1) by this definition. Then we can play around with a and b to see if we'd get smaller results.<p><pre><code> def findregex_lambda(winners, losers, a, b):
"Find a regex that matches all winners but no losers (sets of strings)."
# Make a pool of candidate components, then pick from them to cover winners.
# On each iteration, add the best component to 'cover'; finally disjoin them together.
pool = candidate_components(winners, losers)
cover = []
while winners:
best = max(pool, key=lambda c: a*len(matches(c, winners)) - b*len(c))
cover.append(best)
pool.remove(best)
winners = winners - matches(best, winners)
return '|'.join(cover)
>>> findregex_lambda(starwars, startrek, 3, 1)
' T|E.P| N'
>>> findregex_lambda(starwars, startrek, 3, 2)
' T|B| N| M'
</code></pre>
Or, to automate this:<p><pre><code> def best_H_heuristic(winners, losers):
d = {(a,b) : len(findregex_lambda(winners, losers, a,b)) for a in range(0,4) for b in range(0,4)}
return min(d, key=d.get)
>>> best_H_heuristic(starwars, startrek):
(3,1)
</code></pre>
Looks like H(3,1) is pretty good for this case. What about the nfl teams?<p><pre><code> >>> best_H_heuristic(nfl_in, nfl_out)
(3, 2)
>>> findregex_lambda(nfl_in, nfl_out, 3, 1)
'pa|g..s|4|fs|sa|se|lt|os'
>>> findregex_lambda(nfl_in, nfl_out, 3, 2)
'pa|ch|4|e.g|sa|se|lt|os'
</code></pre>
Not the best heuristic there. H(3,1) wins or ties for the boys/girls set, left/right set and drugs/cities set, which just goes to show you that picking a heuristic off a gut guess isn't such a bad approach.<p>You could also explore heuristics of different forms:<p><pre><code> M(a,b,d,e) = lambda c: a*len(matches(c, uncovered))^b - d*len(c)^e
</code></pre>
Or trying completely different formats:<p><pre><code> L(a,b) = lambda c: a*log(len(matches(c, uncovered))) - b*len(c)</code></pre>