I figured I'd throw this out there. This explains what the last remaining difficulty in my StructuredText parser is.
The problem is this... if you have tokens which are made up of the same characters, the parser needs to do a better job of figuring out which one opens and which one closes. It has to DWIM better than it's doing. So, if you have bold be '*', and italic be '*', and have a line like ***bold italic* italic* or ***bold italic* bold** it's got to be able to parse both correctly
My regular expressions in my current parser do that correctly, so when I get a chance to dive into it I should be able to make what worked for my regular expressions work for my new parser without regular expressions. How hard it's going to be to code depends on whether the regular expressions worked because of backtracking.
I think part of the solution is going to be to, for a given open or close tag (such as '*'), look at what other tags share that character, and simply make sure that it can only be one tag. That is, to make sure it's not ambiguous. This involves precedence rules to disambiguate, such as "short tags match first", so that '*' will match before '*', and it shouldn't involve searching around more than a few characters. Hopefully, it will only involve code when matching an opening tag, and hopefully it'll only involve looking *forward, and not back. We'll see.
By the way, I've added a StructuredText page on my wiki to keep track of different StructuredText implementations. Also keep in mind that when I refer to StructuredText, that's not one set thing. I use "StructuredText" to refer to any plain text format that uses conventions to give structure to the markup without having to use a formal markup language like XML or LaTeX, etc.
Feel free to post a comment below. Please see my comment policy.
Formatting Rules (No HTML):