PCPlus 258: Parsing comma-separated values

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the back of every issue. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. After all, the PDFs do appear on each issue's DVD after a couple of months.

PCPlus logo Onto number four in my ongoing set of articles for PCPlus and this time I wanted to talk about state machines, and instead it turned into a discussion about Comma-Separated Values (CSV) files. Unfortunately the strap called them CSU files, but hey-ho.

The rationale for this is a series of prior blog posts and such where I'd investigated and expanded on what it would take to parse the CSV format. These prior investigations turned into a Java implementation when someone converted my C# library. Unlike before when I started out from the language grammar, this time my angle of attack was from the state machine diagram. It's always easier to draw a diagram than to work out the BNF for some grammar.

Unfortunately, this was yet another attempt where I tried to show code in the article. I hadn't yet seen issue 257 where this had proved a failure (I write articles about three months in advance, and Barnes & Noble get copies maybe three weeks after they appear in the UK, so it's almost four months from sending off a zip to me reading the published end-result), and so I continued thinking that there was no problem.

So, download the PDF and follow along with the code displayed here. First of all the actual state machine:

  public interface IState {
    IState Process(char ch);
    bool IsTerminator { get; }
  }

  public static class CsvStateMachine {
    public static void Execute(string text, IState startState) {
      IState currentState = startState;
      foreach (char c in text) {
        currentState = currentState.Process(c);
      }
      if (!currentState.IsTerminator)
        throw new Exception("Done parsing, final field is not complete");
      FieldProcessor.Finish();
    }
  }

The FieldProcessor class:

  public static class FieldProcessor {
    private static string field = String.Empty;
    public static void AddChar(char c) {
      field += c;
    }
    public static void Finish() {
      Console.WriteLine('[' + field + ']');
      field = String.Empty;
    }
  }

And finally the first state as a class:

  public class FieldStartState : IState {
    public IState Process(char ch) {
      switch (ch) {
        case ',':
          FieldProcessor.Finish();
          return this;
        case '"':
          return new ScanQuotedFieldState();
        case ' ':
          return this;
        default:
          FieldProcessor.AddChar(ch);
          return new ScanFieldState();
      }
    }

    public bool IsTerminator {
      get { return true; }
    }
  }

Bizarrely, I cannot now find the actual solution from which this code is gathered. What you see here was copied from my original Word doc, where it is nicely syntax-highlighted, so I must have had the solution in Visual Studio at some point.

The article first appeared in issue 258, August 2007.

You can download the PDF here.

UPDATE: (about half an hour later) I'd recreated the code:

  internal class ScanQuotedFieldState : IState {
    public IState Process(char ch) {
      switch (ch) {
        case '"':
          return new TerminateFieldState();
        default:
          FieldProcessor.AddChar(ch);
          return this;
      }
    }

    public bool IsTerminator {
      get { return false; }
    }
  }

  internal class ScanFieldState : IState {
    public IState Process(char ch) {
      switch (ch) {
        case ' ':
          return new TerminateFieldState();
        case ',':
          FieldProcessor.Finish();
          return new FieldStartState();
        default:
          FieldProcessor.AddChar(ch);
          return this;
      }
    }

    public bool IsTerminator {
      get { return true; }
    }
  }

  internal class TerminateFieldState : IState {
    public IState Process(char ch) {
      switch (ch) {
        case ' ':
          return this;
        case '"':
          FieldProcessor.AddChar(ch);
          return new ScanQuotedFieldState();
        case ',':
          FieldProcessor.Finish();
          return new FieldStartState();
        default:
          throw new Exception("Invalid character after field was terminated.");
      }
    }

    public bool IsTerminator {
      get { return true; }
    }
  }

This shows the version where quotes inside quoted fields are, er, double quoted. If that makes sense...

Finally here's the method that kicks off the parser:

  public static class CsvStateMachine {
    //...
    public static void Parse(string text) {
      Execute(text, new FieldStartState());
    }
  }

 

Album cover for Brothers in Arms Now playing:
Dire Straits - Why Worry
(from Brothers in Arms)

 

Loading similar posts...   Loading links to posts on similar topics...

2 Responses

 avatar
#1 Schalk Versteeg said...
04-Sep-14 3:34 AM

Hi Julian,

I Tried downloading the article PDF but obtained the following error:

```

AccessDenied

Access Denied

6634AAE20E0494EF

EDP+1J27I0Zvbjx3CTAGvYG/tlWkNyOlKcTGVSu4DqfY0oaK2/znntRbZMj36gyO

```

Is the pdf for this article still available?

julian m bucknall avatar
#2 julian m bucknall said...
08-Sep-14 4:01 PM

Schalk: Weird one. I have two links to the PDF in this post, and the first one was an invalid URL. Now fixed. Sorry about that.

Cheers, Julian

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

  •  Emphasize with italics: surround word with underscores _emphasis_
  •  Emphasize strongly: surround word with double-asterisks **strong**
  •  Link: surround text with square brackets, url with parentheses [text](url)
  •  Inline code: surround text with backticks `IEnumerable`
  •  Unordered list: start each line with an asterisk, space * an item
  •  Ordered list: start each line with a digit, period, space 1. an item
  •  Insert code block: start each line with four spaces
  •  Insert blockquote: start each line with right-angle-bracket, space > Now is the time...
Preview of response