Keith Devens .com |
Thursday, March 18, 2010 | ![]() |
| Mistakes are as serious as the results they cause. – House (The Mistake – ep 30) | ||
|
| ← Power Line: Two thoughts on Judge Alito | MozillaZine: Thunderbird 1.5 Released → |

David wrote:
David wrote:
If you are interested in viewing the lisp code for regexp-opt, you can find it here
Keith (http://keithdevens.com/) wrote:
Thanks David! I had actually been thinking of updating my "I'd love to see a solution to this in K" to include "or Lisp" at the end, but you beat me to it. Though that Elisp code is more than a little hairy. Also, I noticed from your regex that I'd forgotten to include "three" in my regex, so that's been corrected.
mtve wrote:
For perl see http://search.cpan.org/~dankogai/Regexp-Optimizer/ or http://search.cpan.org/~jhi/Regex-PreSuf/
Keith (http://keithdevens.com/) wrote:
Ok, finally got around to finishing this. Here's the code (just 41 lines including the harness):
def regexify_groups(groups):
return "|".join(regexify_terms(terms) for terms in groups)
def regexify_terms(terms):
#'terms' must not be empty, and all must begin with the same letter
if len(terms) == 1:
return terms[0]
sublist = [term[1:] for term in terms if len(term) > 1]
grouped_terms = [group for group in group_by_first_char(sublist)]
regex = regexify_groups(grouped_terms)
if len(grouped_terms) > 1: #if you have multiple groups, group them
regex = "("+regex+")"
if len(sublist) < len(terms): #if a term got eliminated
if len(regex) > 1: #if multiple letters,
regex = "("+regex+")" #need parens
regex += '?' #optional
return terms[0][0]+regex
def group_by_first_char(lst):
#must not be called with empty strings or an empty list
prev = lst[0][0] #get starting first character
temp = []
for item in lst:
if item[0] != prev:
yield temp
prev = item[0]
temp = []
temp.append(item)
if temp: yield temp
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print "Usage: "+sys.argv[0]+" term1[ term2[ term3...]"
else:
print regexify_groups(group_by_first_char(sorted(set(sys.argv[1:]))))
There are surely more efficient ways of doing this, but I think this is pretty elegant. Though, I'd totally use Jarkko's code over this since it does more (thanks mtve).
Feel free to post a comment below. Please see my comment policy.
Formatting Rules (No HTML):
Generated in about 0.14s.
(Used 8 db queries)
The Emacs Lisp function `regexp-opt' does this. So:
(regexp-opt '("one" "two" "three" "four" "five" "twenty" "tween"))
yeilds:
"f\\(?:ive\\|our\\)\\|one\\|t\\(?:hree\\|w\\(?:e\\(?:en\\|nty\\)\\|o\\)\\)"
Yeah, Elisp suffers from backslash-itis.