KBD

Keith Devens .com

Saturday, November 22, 2008 Flag waving
Mistakes were made. – Ronald Reagan
← Power Line: Two thoughts on Judge AlitoMozillaZine: Thunderbird 1.5 Released →

Daily link icon Wednesday, January 11, 2006

Short programming problem

Here's a short programming problem: Given a list of tokens, generate a minimal regular expression that matches those (and only those) tokens. So, given ['one','two','three','four','five','twenty','tween'], the regular expression would be "f(ive|our)|one|t(hree|w(e(en|nty)|o))". I've shown it sorted, but it doesn't have to be. Also, don't worry about a '|'.

Now, an interesting problem would be how to generate a provably minimal regular expression... maybe there's already an algorithm I don't know about that does that. But all I'm trying to do here is prefix stemming.

I'll probably post some Python code later that solves the problem -- I'm just about finished writing it, but won't get to finish tonight. I'm most interested in your approaches. Who wants to try? Smiley

(I'd love to see a solution in K.)

← Power Line: Two thoughts on Judge AlitoMozillaZine: Thunderbird 1.5 Released →

Comments XML gif

David wrote:

The Emacs Lisp function `regexp-opt' does this. So:

(regexp-opt '("one" "two" "three" "four" "five" "twenty" "tween"))

yeilds:

"f\\(?:ive\\|our\\)\\|one\\|t\\(?:hree\\|w\\(?:e\\(?:en\\|nty\\)\\|o\\)\\)"

Yeah, Elisp suffers from backslash-itis. Smiley

∴ David | 11-Jan-2006 9:38pm est | #8978

David wrote:

If you are interested in viewing the lisp code for regexp-opt, you can find it here

∴ David | 11-Jan-2006 11:13pm est | #8979

Keith (http://keithdevens.com/) wrote:

Thanks David! I had actually been thinking of updating my "I'd love to see a solution to this in K" to include "or Lisp" at the end, but you beat me to it. Though that Elisp code is more than a little hairy. Also, I noticed from your regex that I'd forgotten to include "three" in my regex, so that's been corrected.

Keith | 11-Jan-2006 11:37pm est | http://keithdevens.com/ | #8980

mtve wrote:

∴ mtve | 12-Jan-2006 9:47am est | #8989

Keith (http://keithdevens.com/) wrote:

Ok, finally got around to finishing this. Here's the code (just 41 lines including the harness):

def regexify_groups(groups):
    return "|".join(regexify_terms(terms) for terms in groups)

def regexify_terms(terms):
    #'terms' must not be empty, and all must begin with the same letter
    if len(terms) == 1:
        return terms[0]

    sublist = [term[1:] for term in terms if len(term) > 1]
    grouped_terms = [group for group in group_by_first_char(sublist)]
    regex = regexify_groups(grouped_terms)

    if len(grouped_terms) > 1: #if you have multiple groups, group them
        regex = "("+regex+")"

    if len(sublist) < len(terms): #if a term got eliminated
        if len(regex) > 1: #if multiple letters,
            regex = "("+regex+")" #need parens
        regex += '?' #optional

    return terms[0][0]+regex

def group_by_first_char(lst):
    #must not be called with empty strings or an empty list
    prev = lst[0][0] #get starting first character
    temp = []
    for item in lst:
        if item[0] != prev:
            yield temp
            prev = item[0]
            temp = []
        temp.append(item)

    if temp: yield temp

if __name__ == "__main__":
    import sys
    if len(sys.argv) < 2:
        print "Usage: "+sys.argv[0]+" term1[ term2[ term3...]"
    else:
        print regexify_groups(group_by_first_char(sorted(set(sys.argv[1:]))))

There are surely more efficient ways of doing this, but I think this is pretty elegant. Though, I'd totally use Jarkko's code over this since it does more (thanks mtve).

Keith | 16-Jan-2006 3:29pm est | http://keithdevens.com/ | #9002

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

November 2008
SunMonTueWedThuFriSat
 1
2345678
9101112131415
16171819202122
23242526272829
30 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 5 posts

Recent comments XML

new⇒Ubuntu Nvidia install not working for me... could use a hand

Cant change xorg.conf!

I'm not​the owner of it, don't ask me​why
but it...

I)orogon: Nov 22, 5:41am

Calif. Supreme Court to take up gay marriage ban

I would argue the point is not​definitional.  While the word​marriage is su...

Justin: Nov 20, 4:37pm

Java join function

Meh, don't have null strings in​your string arrays imo, but you're​welcome ...

Keith: Nov 19, 7:51pm

Girls, please don't get breast implants

sorry but another thing i have to​make a comment on about you​men...the men...

happynow: Nov 17, 11:36pm

Books by Vincent Cheung

to all Cheung​fans:

read:

http://www.progin​osko.com/aquascum/cheung.h...

Zamir: Nov 16, 9:07am

Generated in about 0.208s.

(Used 8 db queries)

mobile phone