I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the back of every issue. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. After all, the PDFs do appear on each issue's DVD after a couple of months.
Onto number four in my ongoing set of articles for PCPlus and this time I wanted to talk about state machines, and instead it turned into a discussion about Comma-Separated Values (CSV) files. Unfortunately the strap called them CSU files, but hey-ho.
The rationale for this is a series of prior blog posts and such where I'd investigated and expanded on what it would take to parse the CSV format. These prior investigations turned into a Java implementation when someone converted my C# library. Unlike before when I started out from the language grammar, this time my angle of attack was from the state machine diagram. It's always easier to draw a diagram than to work out the BNF for some grammar.
Unfortunately, this was yet another attempt where I tried to show code in the article. I hadn't yet seen issue 257 where this had proved a failure (I write articles about three months in advance, and Barnes & Noble get copies maybe three weeks after they appear in the UK, so it's almost four months from sending off a zip to me reading the published end-result), and so I continued thinking that there was no problem.
So, download the PDF and follow along with the code displayed here. First of all the actual state machine:
public interface IState { IState Process(char ch); bool IsTerminator { get; } } public static class CsvStateMachine { public static void Execute(string text, IState startState) { IState currentState = startState; foreach (char c in text) { currentState = currentState.Process(c); } if (!currentState.IsTerminator) throw new Exception("Done parsing, final field is not complete"); FieldProcessor.Finish(); } }
The FieldProcessor
class:
public static class FieldProcessor { private static string field = String.Empty; public static void AddChar(char c) { field += c; } public static void Finish() { Console.WriteLine('[' + field + ']'); field = String.Empty; } }
And finally the first state as a class:
public class FieldStartState : IState { public IState Process(char ch) { switch (ch) { case ',': FieldProcessor.Finish(); return this; case '"': return new ScanQuotedFieldState(); case ' ': return this; default: FieldProcessor.AddChar(ch); return new ScanFieldState(); } } public bool IsTerminator { get { return true; } } }
Bizarrely, I cannot now find the actual solution from which this code is gathered. What you see here was copied from my original Word doc, where it is nicely syntax-highlighted, so I must have had the solution in Visual Studio at some point.
The article first appeared in issue 258, August 2007.
You can download the PDF here.
UPDATE: (about half an hour later) I'd recreated the code:
internal class ScanQuotedFieldState : IState { public IState Process(char ch) { switch (ch) { case '"': return new TerminateFieldState(); default: FieldProcessor.AddChar(ch); return this; } } public bool IsTerminator { get { return false; } } } internal class ScanFieldState : IState { public IState Process(char ch) { switch (ch) { case ' ': return new TerminateFieldState(); case ',': FieldProcessor.Finish(); return new FieldStartState(); default: FieldProcessor.AddChar(ch); return this; } } public bool IsTerminator { get { return true; } } } internal class TerminateFieldState : IState { public IState Process(char ch) { switch (ch) { case ' ': return this; case '"': FieldProcessor.AddChar(ch); return new ScanQuotedFieldState(); case ',': FieldProcessor.Finish(); return new FieldStartState(); default: throw new Exception("Invalid character after field was terminated."); } } public bool IsTerminator { get { return true; } } }
This shows the version where quotes inside quoted fields are, er, double quoted. If that makes sense...
Finally here's the method that kicks off the parser:
public static class CsvStateMachine { //... public static void Parse(string text) { Execute(text, new FieldStartState()); } }
Now playing:
Dire Straits - Why Worry
(from Brothers in Arms)
2 Responses
#1 Schalk Versteeg said...
04-Sep-14 3:34 AMHi Julian,
I Tried downloading the article PDF but obtained the following error:
```
AccessDenied
EDP+1J27I0Zvbjx3CTAGvYG/tlWkNyOlKcTGVSu4DqfY0oaK2/znntRbZMj36gyO
```
Is the pdf for this article still available?
#2 julian m bucknall said...
08-Sep-14 4:01 PMSchalk: Weird one. I have two links to the PDF in this post, and the first one was an invalid URL. Now fixed. Sorry about that.
Cheers, Julian
Leave a response
Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.
_emphasis_
**strong**
[text](url)
`IEnumerable`
* an item
1. an item
> Now is the time...
Preview of response