KBD

Keith Devens .com

Friday, August 29, 2008 Flag waving
All I want is to have my peace of mind. – Boston (Peace of Mind)
← John Podhoretz on how liberals think Bush is the "Evil Doofus"Kyoto Protocol is just lots of hot air →

Daily link icon Wednesday, December 8, 2004

Regular expression question

It's pretty trivial to split a key:value pair with a regular expression. Something like /^\s*([^:]+)\s*:\s*(.*?)\s*$/ can do the job. But how can you do it if you want to allow escaped colons in your key?

Say you allow the escapes "\\" → '\' and "\:" → ':' in your key. Is there any way to use a regular expression to split the key and value while ignoring a "\:"? Keep in mind that that could actually be "\\:" in which case you still should split on that colon because it's actually the backslash that's escaped. So, a simple one character lookbehind isn't sufficient.

Any solutions?

← John Podhoretz on how liberals think Bush is the "Evil Doofus"Kyoto Protocol is just lots of hot air →

Comments XML gif

Hans (http://zephyrfalcon.org/) wrote:

I'm not sure a regular expression is necessary here. If there is a character that is certain not be used in your key/value string (e.g. 0xff), then you can use that. Replace the escaped colon "\:" with the special character, then split on the real colon, then replace the special character in the parts with a colon again.

>>> s = "foo\:bar:baz\:xyzzy"
>>> t = s.replace("\:", "\xff")
>>> t
'foo\xffbar:baz\xffxyzzy'
>>> parts = t.split(":")
>>> parts
['foo\xffbar', 'baz\xffxyzzy']
>>> key = parts[0].replace("\xff", ":")
>>> value = parts[1].replace("\xff", ":")
>>> key, value
('foo:bar', 'baz:xyzzy')

Just my $0.02...

∴ Hans | 8-Dec-2004 9:22pm est | http://zephyrfalcon.org/ | #6543

Hans (http://zephyrfalcon.org/) wrote:

That should really have been "\\:", of course; it's lucky that "\:" happens to work as well.

∴ Hans | 8-Dec-2004 9:23pm est | http://zephyrfalcon.org/ | #6544

Keith (http://keithdevens.com/) wrote:

Replace the escaped colon "\:" with the special character, then split on the real colon...

But you still have the question of how to identify the escaped colon. For instance, you can have "key\::value", "key\\::value", "key\\\::value", or "key\\\\::value". The first colon is "real" only on the second and fourth examples, but looks like "\:" in all of them.

Keith | 8-Dec-2004 10:34pm est | http://keithdevens.com/ | #6545

Leland Johnson (http://protoplasmic.org/) wrote:

Well, in perl you can use the fancy "zero-width negative look-behind assertion" and "zero-width positive look-behind assertion" thingies.

Try this:
perl -e'print join("\n", split(qr/(?<=\\\\)Smiley indifferent(?<!\\):/, <STDIN>))'
with this input:
asdfsdf:here be colon \: haha!:ending backslash \\:real end
Should result in:
asdfsdf
here be colon \: haha!
ending backslash \\
real end

Here's it explained:

(?<=\\\\):
Match "\\:" without including backslash backslash in the match (so the split doesn't eat it up).

or
(?<!\\):
Match a ":" that does not have a backslash backslash proceeding it.
(Note that backslashes were escaped in the regex, but not the sample strings.)

So you'd still have to replace "\:" with ":" and back back with back afterwards anyways.

The "zero-width negative look-behind assertion", "(?<!pattern)", and "zero-width positive look-behind assertion", "(?<=pattern)", expressions are quite germane/obscure though.
'
I'll readily admit that I had to pull up perlre to remember this, and I got it wrong a few times too.

I would very much advise against using this in production code. It doesn't handle "\\\\\:" and etc properly. A regex doesn't beat a state machine though. I think it wins on speed and readability, but I wouldn't know. Thanks for the fun problem though!

Text::Balanced might be able to do what you want.

And your blogging software seems to be acutally using something like my solution - I can't get "n backslashes and a double quote" (were n >= 3) to display properly.

∴ Leland Johnson | 9-Dec-2004 2:39am est | http://protoplasmic.org/ | #6546

Leland Johnson (http://protoplasmic.org/) wrote:

And your blogging software stole the colon pipe in the middle of the expression and replaced it with Smiley indifferent. Smiley indifferent
Smiley
Posting from lynx is not fun.

∴ Leland Johnson | 9-Dec-2004 2:43am est | http://protoplasmic.org/ | #6547

Keith (http://keithdevens.com/) wrote:

If you'd used a code block it wouldn't have happened:

perl -e'print join("\n", split(qr/(?<=\\\\):|(?<!\\):/, <STDIN>))'
Keith | 9-Dec-2004 2:49am est | http://keithdevens.com/ | #6548

Keith (http://keithdevens.com/) wrote:

I think what I want to do is provably impossible. Only, I'm not sure how to prove it. I was hoping someone more clever than I could think of a way.

Keith | 9-Dec-2004 3:15am est | http://keithdevens.com/ | #6549

Jonas wrote:

import re
print re.split(r'(?<!\\):', 'te\:st\\1:va\\lue1')
∴ Jonas | 9-Dec-2004 9:03am est | #6551

129.42.208.182 wrote:

Instead of saying "one or more non-colon characters", you say "one or more of either a backslash followed by anything or a non-colon character".


/^\s*((\\.|[^:])+)\s*:\s*(.*?)\s*$/

∴ 129.42.208.182 | 9-Dec-2004 12:52pm est | #6552

Keith (http://keithdevens.com/) wrote:

>>> re.split(r'(?<!\\):', 'te\\:st\\1:va\\lue1')
['te\\:st\\1', 'va\\lue1']
Keith | 9-Dec-2004 12:54pm est | http://keithdevens.com/ | #6553

Keith (http://keithdevens.com/) wrote:

129.42.208.182:

Excellent. Only one problem:

$\ = "\n";
$_ = 'key\\\\\\\\\\:value';
/^\s*((?:\\.|[^:])+?)\s*:\s*(.*?)\s*$/;
print;
print $1;
print $2;

(regex modified slightly to make the inner group non-capturing and to make the capture non-greedy)

Prints:

key\\\\\:value
key\\\\\
value

So, any way you can think of to get it to not split in a case like this when it shouldn't?

Note that if there was another option for it:

key\\\\\:value:value2
key\\\\\:value
value2

It correctly waits until it finds a match later on in the string.

Keith | 9-Dec-2004 1:30pm est | http://keithdevens.com/ | #6556

Keith (http://keithdevens.com/) wrote:

Hey, I think this did it:

$\ = "\n";
$_ = 'key\\\\\\\\\\:value';
/^\s*((?:\\.|[^:])+)\s*(:?)\s*(.*?)\s*$/;
print;
print $1;
print $2;
print $3;

Prints:

key\\\\\:value
key\\\\\:value


And if you add a slash above it prints:

key\\\\\\:value
key\\\\\\
:
value

So, as long as $2 is set to a colon you know it got through the key to the value, and not just that there was a blank value.

Alternatively:

use warnings;
use strict;
$\ = "\n";
$_ = 'key\\\\\\\\\\:value';
/^\s*((?:\\.|[^:])+)\s*:?\s*(.*?)??\s*$/;
print;
print $1;
print $2;

Gives:

key\\\\\:value
key\\\\\:value
Use of uninitialized value in print at test.pl line 8.

Now I just wonder if there's any way to make the regex fail algother on an invalid key:value.

Keith | 9-Dec-2004 1:55pm est | http://keithdevens.com/ | #6557

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

August 2008
SunMonTueWedThuFriSat
 12
3456789
10111213141516
17181920212223
24252627282930
31 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 2 posts

Recent comments XML

new⇒Johnny Walker Blue Label

Wow, thanks for the scotch review​:D

Lagavulin and Laphroaig are​some of...

Keith: Aug 29, 3:35pm

Girls, please don't get breast implants

Wow, After all this time, the​comments on this page continue to​grow. It wa...

Ajeet: Aug 25, 2:36am

Generated in about 0.16s.

(Used 8 db queries)

mobile phone