Last time in this two-parter, I laid down the basics of the RTF I followed in pasting code from VS to WLW, and some of the helper classes I started off with. This time, we'll look at the parser and the various tricks I used to make sure that the translated HTML was valid and produced the correct look for the code in a web page.
The parser I implemented is pretty much a standard top-down parser. I started off with a method that parsed the RTF document by calling other parsing methods to parse the header and the document, and those parsing methods would in turn call others to parse different chunks of the RTF document. Eventually I'd get to the point where I'd be parsing tokens. Essentially, a top-down parser is like a matryoshka doll, a set of nested dolls. The result of each parsing step would be some HTML markup.
Let's see how that works. First up is the highest level parsing method which sets up the initial state. It is this method that will be called by the WLW plug-in.
public static string ParseRtf(string rtfValue) { ParserState state = new ParserState(rtfValue); IParseResult result = ParseRtf(state); if (result.Failed) return "***FAILED***"; string html = result.Value; if (state.InBackgroundSpan) html += ""; if (state.InFontColorSpan) html += ""; return html; }
ParseRtf()
gets the RTF document as a string (if you look here, you'll see that's what the Clipboard object returns), creates a new parser state object with it, and then calls the controlling top-level ParseRtf()
method. On return, if the parser failed, we just return a simple error message string. (It looks a little tacky perhaps, but I'm not particularly bothered: I'll see it immediately in WLW and will be able to fix it. Without fail, it'll be because I forgot to copy the code properly from VS, and so the "fix" will be to try again.) If the parse succeeded, I need to make sure that any background or foreground color spans are properly terminated. I'll get to the reason for this in a moment when I talk about parsing the document content.
Here's the top-level parse method:
private static IParseResult ParseRtf(ParserState state) { if (state.Current != '{') return FailedParse.Get; state.Advance(); IParseResult result = ParseHeader(state); if (result.Failed) return result; result = ParseDocument(state); if (result.Failed) return result; if (state.Current != '}') return FailedParse.Get; state.Advance(); return result; }
When this method is called it expects to see the very first brace of an RTF document. If it doesn't, fail immediately. Otherwise, jump over the brace, and then parse the header. If that failed, return immediately. If it succeeded, parse the document content. If that, in turn, failed, return immediately. If it succeeded, I expect to see the final closing brace of the whole RTF document. If not, fail immediately. Otherwise, jump over the final brace and return.
You can see from this the general process for all the parsing methods: it's a sequence of calling some more specific parse method and, if it failed, to return immediately with a failure. If it succeeded, go on to the next more-specific parse method and do the same. Sometimes, you do a check for a particular character you expect to be present in the RTF stream and fail if it's not there. Having described the general process, I won't draw attention to it again, unless there's something specific I want to discuss.
Let's now parse the header block for an RTF document, at least the very simplified header VS gives us and bearing in mind that we don't really care about much of it, apart from the color table.
private static IParseResult ParseHeader(ParserState state) { IParseResult result; do { result = ParseHeaderKeyword(state); if (result.Failed) return result; } while (state.Current != '{'); do { result = ParseHeaderGroup(state); if (result.Failed) return result; } while (state.Current != '\\'); return result; }
This method is the first where I really throw away a lot of stuff I'm not interested in. In essence, I make the assumption that there are a bunch of keywords (that is, alphanumeric identifiers preceded by backslashes), followed by a set of header groups (data enclosed in braces). We'll see that one of these header groups is the color table. The header groups are terminated by a backslash, which happens to be the first keyword of the document content. (Please check out the example RTF document here.)
I'll make the point again that I am deliberately simplifying the RTF header to suit my very specific need to convert code in RTF into HTML markup. The header has way more structure than this, but, again, I'm not interested. Don't take this code and apply it to a general word-processing document!
The ParseHeaderKeyword()
method is very simple and makes use of a ParseKeyword()
method to read the identifier:
private static IParseResult ParseKeyword(ParserState state) { StringBuilder sb = new StringBuilder(); do { sb.Append(state.Current); state.Advance(); } while (char.IsLetterOrDigit(state.Current)); if (state.Current == ' ') state.Advance(); return new SuccessfulParse(sb.ToString()); } private static IParseResult ParseHeaderKeyword(ParserState state) { if (state.Current != '\\') return FailedParse.Get; state.Advance(); return ParseKeyword(state); }
Essentially: a keyword starts with a backslash, has a bunch of alphanumeric characters, and might have a terminating space that we should skip. I do return the keyword (without the backslash) as part of a successful result.
The ParseHeaderGroup()
method is next:
private static IParseResult ParseHeaderGroup(ParserState state) { if (state.Current != '{') return FailedParse.Get; state.Advance(); IParseResult result = ParseHeaderKeyword(state); if (result.Failed) return result; if (result.Value == "colortbl") { result = ParseColors(state); } else { result = ParseHeaderGroupData(state); } if (state.Current != '}') return FailedParse.Get; state.Advance(); return result; }
The group starts with an opening brace and has a keyword that describes the group. If that keyword is colortbl
, I need to parse out the colors and create a list to help with parsing the document content. Otherwise I just need to parse the remainder of the group. At the end I expect to see the closing brace, which can I skip.
Let's get parsing the group data out of the way (in essence, I ignore it all):
private static IParseResult ParseHeaderGroupData(ParserState state) { StringBuilder sb = new StringBuilder(); while (state.Current != '}') { if (state.Current == '{') { IParseResult result = ParseHeaderGroup(state); if (result.Failed) return result; sb.Append(result.Value); } else { sb.Append(state.Current); state.Advance(); } } return new SuccessfulParse(sb.ToString()); }
The only interesting thing about this is that a header group may have another header group embedded in it. (Take a look at the font table in the RTF document.) So I have to make sure I track the opening and closing braces.
The color table in the RTF document consists of a set of colors, each terminated with a semicolon.
private static IParseResult ParseColors(ParserState state) { do { StringBuilder sb = new StringBuilder(); while (state.Current != ';') { sb.Append(state.Current); state.Advance(); } state.ColorTable.Add(sb.ToString()); state.Advance(); } while (state.Current != '}'); return new SuccessfulParse(""); }
The method parses each color out as a string in the form \red43\green145\blue175
and calls the ColorTable
objects' Add()
method to add each of them. The delimiting semicolons are jumped over. The method finishes when it sees the final closing brace of the group.
We've now seen all the header parsing methods that we need for our particular application. In essence, we ignore everything except for the embedded color table, which we extract into a list of colors for our own purposes. Now the fun stuff: the document content, the actual colorized code from VS.
The essential process here is to read the content character by character, building an HTML encoded string as we go. We stop when we get to the closing brace of the entire RTF document. If the current character is not a backslash, we convert it to an HTML entity if needed (so &
becomes &
for example), and append it to the HTML string. If the current character is a backslash, we may be seeing an escaped character or it may be a start of a keyword. If the former, we encode it if needed and add it to the HTML. If the latter, we need to process that keyword, whatever it may be.
private static IParseResult ParseDocument(ParserState state) { StringBuilder sb = new StringBuilder(); while (state.Current != '}') { if (state.Current == '\\') { IParseResult result = ParseDocEscapedChar(state); if (result.Succeeded) { sb.Append(ConvertEntity(state.Current)); state.Advance(); } else { result = ParseDocKeyword(state); if (result.Failed) return result; string s = ProcessDocKeyword(result.Value, state); sb.Append(s); } } else { sb.Append(ConvertEntity(state.Current)); state.Advance(); } } return new SuccessfulParse(sb.ToString()); }
This method has a slew of simple methods that need little to no explanation. First, escaping a character fails if the next letter is alphanumeric (since it would then be a keyword not an escaped character), succeeds otherwise. Notice that the method will jump over the backslash.
private static IParseResult ParseDocEscapedChar(ParserState state) { state.Advance(); if (char.IsLetterOrDigit(state.Current)) return FailedParse.Get; return new SuccessfulParse(state.Current.ToString()); }
The conversion to an HTML entity is pretty simple and might need to be extended if you use other characters in your code that need converting to an entity.
public static string ConvertEntity(char current) { switch (current) { case '&': return "&"; case '<': return "<"; case '>': return ">"; default: return current.ToString(); } }
And now, finally we get to the really interesting method, the one that processes a keyword in the content.
Some background first. There are three keywords that we're going to process, all the others are ignored. The three are:
\cfN
: set the font color to color N, where N is an index into the color table. \cf0
means "set the font to the default font color".\cbN
: set the background color to color N, where N is again an index into the color table. \cb0
means "set the background to the default background color".\par
: output a new line.Seems simple enough, but there is a problem with the color keywords.
In essence what I'm going to do is to issue tags around text that is of a different color. So to take an example from our RTF document, I'm going to convert the RTF
\cf1 namespace\cf0
into the HTML
<span style="color: #0000ff">namespace</span>
In other words, replace the "change color to N" keyword with an opening span tag, styled to the correct color, and the "revert to the default color" keyword with the relevant closing tag.
If all the RTF was like this, there would be no problem. However, check out the following code fragment:
return String.Format(
Its RTF version is this:
\cf1 return\cf0 \cf4 String\cf5 .\cf0 Format(
That first color change is the type I've already identified as simple. The second one is worse. In RTF it says: change the font color to 4, output "String", change the font color to 5, output ".", revert to the default font color. In other words, RTF doesn't surround text with begin color, end color pairs — which we'd like to have since that's how HTML works — but acts more like a stream: start this color, output some text, start this other color, output text, start another color, output text, etc, etc. There's essentially no "stop using this color" keyword, although we can use \cf0
for that.
So in the conversion code I have to track whether we're in a "font color span" and in a "background color span". If we are and we receive a color change keyword that is not the "revert to the default" keyword, we have to output a closing span tag before we open up another span tag.
Here's another code fragment:
return String.
which has the following RTF:
{...header stuff... \fs24 \cf1 return\cf0 \cf4 String\cf5 .}
Notice that there is no "revert to default color" keyword at the end of the content. We're left dangling in "font color 5" mode. No can do in HTML, so that's why in the very top of the parsing tree, I had those extra checks to output tags if they were needed.
Having provided the background, it should be easy to understand the rather complicated, somewhat unrefactored ProcessDocKeyword()
method
private static string ProcessDocKeyword(string keyword, ParserState state) { int colorIndex; Color color; string format; Regex regex = new Regex(@"([a-z]+)(\d*)"); Match match = regex.Match(keyword); string keywordBase = match.Groups[1].Value; switch (keywordBase) { case "cf": colorIndex = int.Parse(match.Groups[2].Value); if (colorIndex == 0) { state.InFontColorSpan = false; return "</span>"; } color = state.ColorTable[colorIndex]; format = "<span style=\"color: #{0:x2}{1:x2}{2:x2};\">"; if (state.InFontColorSpan) format = "</span>" + format; state.InFontColorSpan = true; return String.Format(format, color.R, color.G, color.B); case "cb": colorIndex = int.Parse(match.Groups[2].Value); if (colorIndex == 0) { state.InBackgroundSpan = false; return "</span>"; } color = state.ColorTable[colorIndex]; format = "<span style=\"background-color: #{0:x2}{1:x2}{2:x2};\">"; if (state.InBackgroundSpan) format = "</span>" + format; state.InBackgroundSpan = true; return String.Format(format, color.R, color.G, color.B); case "par": return Environment.NewLine; } return String.Empty; }
And that's it: converting an RTF document from a copy operation of some selected code in Visual Studio to HTML that we can then paste into Windows Live Writer through a plug-in.
There are bound to be some bugs in this, but it works for how I use the syntax highlighting colors in the VS editor. For example, if you're a fan of the darker, more restful color themes that have colored backgrounds, such as those here, it fails:
namespace RtfToHtml { // The result of a parse operation public interface IParseResult { bool Succeeded { get; } bool Failed { get; } string Value { get; } } }
I can see from this that the background span is being closed off and not the font color span. I'm guessing that my simple "in font color span" and "in background span" bool
s won't cut it and you'll possibly have output the span to include both foreground and background colors. Mind you, in this case, I think the whole enclosing div
should be output with the background color rather than the actual text, as I have here. Another feature, another day.
4 Responses
#1 Davy Landman said...
28-Nov-09 3:58 PMI've struggled with the same problem, and actually I don't like the "dirty" html a RTF -> HTML transformation generates.
The ConTEXT editor offers a html export as well, but created css styles for the syntactical elements of C#. The layout of the code I could then tune using css. The downside is, no support for recognizing the Classes and Interfaces.
I would suspect using DxCore it wouldn't be hard to get a syntax tree of the code and use that to create a bunch of spans with just a class name on them?
#2 julian m bucknall said...
28-Nov-09 11:48 PMDavy: It doesn't bother me that much, to be honest.
Certainly the code could be changed to produce HTML that made use of CSS rather than colored style attributes. However, the issue then is that instead of each code block in the page being separate, they would all have to follow the same CSS. In other words, you'd have to parse the code itself to identify comment, types, operators, strings, numbers, etc, etc and then colorize them. There are JavaScript parsers that do this at run time (Google have one for example), but I wanted above all to have static HTML, rather than client-side generated syntax highlighting.
Your idea of using DXCore would work too, although DXCore's parsers do go into an awful lot of detail in the AST.
Of course, you could also write or find parsers for all the languages you'd want to colorize. For me, that would be C#, some VB every now and then, JavaScript, XML, HTML, CSS, at the very minimum. Meh. I'll stick with using Microsoft devdiv's developers for that, and I'll just color their output. It's been working for nearly two years and I'm happy with it.
Cheers, Julian
#3 Davy Landma said...
29-Nov-09 4:55 AMActually the ConTEXT editor (written in Delphi) has syntax highlighting included for lots of languages, including C#. So this identifying could be skipped, but the problem is, Visual Studio also offers highlighting to Type names. Because ConTEXT doesn't use a AST to get the correct information, but just (is suspect) regular expression matching, this is not possible.
I haven't looked at DXCore's AST, and I suspect way too much detail indeed.
Personally, I also keep using the RTF->HTML conversion using the available plugins. But I thought, perhaps a suggestion for a nice free plugin to offer on the DevExpress site?
#4 Dew Drop – November 30, 2009 | Alvin Ashcraft's Morning Dew said...
30-Nov-09 5:56 AMPingback from Dew Drop – November 30, 2009 | Alvin Ashcraft's Morning Dew
Leave a response
Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.
_emphasis_
**strong**
[text](url)
`IEnumerable`
* an item
1. an item
> Now is the time...
Preview of response