Now with fewer errors!!!!

Recursive Descent Parsing in C/C++

A very brief introduction on how to do it.

§0. Goal for this Document

In a class that I am currently TA'ing the students were assigned to write a program that would read in a formula of propositional logic and then print out its truth table. Few students had difficulty with the “logic” portion of the assignment, but many seemed to struggle with the parsing.

The language that the propositional logic formula was to be provided in was intentionally made to be rather simple... simple enough that a recursive descent parser (RDP from here on) could be used for parsing. Unfortunately, many of the students were either unfamiliar with RDP or had a tough time figuring out from books or websites how to implement a RDP. This document aims to help with that.

Of course, I have only finished this document after the assignment was due... so no help to the students... Hahahahah! (sorry)

Important Note: Recursive Descent Parsing is really rather simple, even when done in a language like C/C++ (it's even easier in functional languages like Haskell). After someone sees how to do it once, they can usually implement a RDP as quickly as they can type.

For parsing in general, there are many tools out there that will actually generate a parser for you, in whatever language you like (e.g. Lex & Yacc for C/C++/Java, Alex/Happy/Parsec for Haskell, etc.). More on these later. In this case, unless you were already familiar with any of these other tools, writing a simple RDP from scratch was probably your easiest bet.

Also, it's just-plain-good-to-know for any CS student!

§1. Some Theory...With Gratuitous Hand-Waving

Looking at the language of propositional logic, we see we have a few “basic” formula like constants (T and F denoting true and false, respectively), propositions (any string containing lowercase-letters and numbers), and then various connectives that can be used to join together other valid formula

¬(φ)        (φ & ψ)        (φ v ψ)        (φ → ψ)        (φ ↔ ψ)

The idea behind the language, and what constitutes a well formed formula (wff) is simple, but to state it precisely, we'll use a Backus-Naur Form (BNF), which can be used to represent any context free grammar. If you have no idea what these are, don't worry... you'll learn about them later in your Automata Theory course. For now, just know that constructing this BNF (the syntax of which is very simple) aids greatly in implementing the RDP in C/C++.

In fact, it turns out if if you can represent a given language as a BNF (with certain properties imposed on the BNF that I'm waving my hands over), then you can always parse (or “recognize”) that language using a recursive descent parser.

§1.1. The BNF for Our Language

First, let's take a look at the BNF and then each line will be explained. Note that the logical connectives are expressed as their LaTeX command equivalents, which is what the assignment stipulated. Additionally, the operators are in prefix (Polish) notation.

          Formula ::= Constant | Proposition | UnaryFormula | BinaryFormula
         Constant ::= "T" | "F"
      Proposition ::= [a-z0-9]+
     UnaryFormula ::= LeftParen UnaryOperator Formula RightParen
    BinaryFormula ::= LeftParen BinaryOperator Formula Formula RightParen
        LeftParen ::= "("
       RightParen ::= ")"
    UnaryOperator ::= "\neg"
   BinaryOperator ::= "\vee" | "\wedge" | "\rightarrow" | "\leftrightarrow"

§1.2. Explaining the BNF

The BNF is composed of 9 rules, with each rule on a single line. Each rule is of the form:

      Expression ::= definition
and tells you how Expression is to be constructed. As we see, the rules come in a few varieties, each explained next.

Choice. The vertical bar “|” denotes “or”, and says that the pattern may be either of the sub-patterns listed. So, the rule:

          Formula ::= Constant | Proposition | UnaryFormula | BinaryFormula
tells us a Formula is always exactly one of: Constant, Proposition, UnaryFormula, or BinaryFormula. It's one of those patterns, whatever those patterns are.

Sequence. Spaces between sub-patterns in the definition denote that the provided expression is built from the explicit sequence of patterns. So the rule:

     UnaryFormula ::= LeftParen UnaryOperator Formula RightParen
tells us a UnaryFormula is a sequence of a LeftParen, followed by a UnaryOperator, followed by a Formula, followed by a RightParen.

Literal Text. Red text denotes literal strings. So the rules:

        LeftParen ::= "("
         Constant ::= "T" | "F"
tell us that a LeftParen is exactly the string "(", and a Constant always either the string "T" or the string "F".

Text Pattern. The blue pattern used to define Proposition,

      Proposition ::= [a-z0-9]+
tells us that a proposition is any sequence of one-or-more lower-case letters and numbers. Note that is really just shorthand for more BNF rules such as the following:
      Proposition ::= LetterOrNumber String
           String ::= LetterOrNumber String | 0
   LetterOrNumber ::= "a" | "b" | "c" | ...  | "z" | "0" | "1" |  ...  | "9"
and we could, if we were inclined, expand our BNF into more rules. Because, however, this type of string (only lower-case letters and numbers) does NOT overlap with any other rule (constants are capital letters, parenthesis aren't letters, and operators start with a slash), we'll keep this rule as a simple pattern and use sscanf to handle the matching for us.

Putting it All Together. Note the following key ideas from above:

  1. Rules are defined in terms of each other! They are recursive!
    That is, Formula is defined in terms of (among other things) UnaryFormula, and UnaryFormula is defined in terms of Formula.
  2. As we'll see in our implementation, we'll construct a set of mutually-recursive functions that mirror this BNF very closely.
  3. Also, note that the only rules that actually define any literal text are the LeftParen, RightParen, Constant, Proposition, and Operator rules. As we'll see in the implementation, these are also the only functions that do any actual reading of our input string.

Note: Make sure you understand the above. Once you have an understanding of the BNF, you can use it to implement a RDP for our propositional logic language very very quickly, as we'll see in the next section.

§2. Using our BNF to implement a RDP in C

To implement an RDP in C, we're going to use the following approach. Store all of the formula in a single string, of type char[] (i.e. a C-style string). If the formula spans multiple lines, read them into a single string.

Next, we're going to implement a series of functions, one for each rule in our BNF (defining an expression). Each function will behave identically, in the following manner. Given a rule:

   Expression ::= definition

We'll define a function

   bool Expression(char *s[]);

That takes in the input string s and returns a bool. IF the input matches the expression's definition, THEN true will be returned and the input string pointer s will be advanced past the end of the expression in the string. OTHERWISE (if the input string does not match the expression), false will be returned and the input string pointer s will remain unchanged.

Our functions will be eerily similar to the BNF rules they correspond to, and also surprisingly simple. Additionally, they will all follow a similar pattern.

§2.1. Two Helper Functions

Before we get to the implementation, we define two helper functions that we'll use extensively.

The first is SkipWhitespace, which takes in the input string pointer s and simply advances it past any whitespace (spaces, tabs, and newlines). Using this, we can allow arbitrary spacing in our expressions. In most RDP literature, this corresponds to the NextToken function... I just think this name is a bit more descriptive for our case.

   void SkipWhitespace(char *s[]) {
      while (*s[0] == ' ' || *s[0] == '\n' || *s[0] == '\t')
         *s = *s + 1;
   }

The next helper function is match, which takes in the input string pointer s and a fixed, known string named token. It compares the first strlen(token) characters of s to see if it matches token. If so, the pointer s is advanced past the token and true is returned. Otherwise, s remains unchanged and false is returned.

   bool match(char *s[], const char *token) {
      if (strncmp(*s, token, strlen(token)) == 0) {
         *s = *s + strlen(token);
         return true;
      }

      return false;
   }

Note: Recall your simple pointer arithmetic! Also recall that in C and C++, an array name is simply a pointer to the first element of the array. That's the pointer we're advancing through the string. Don't worry, it's never very complicated.

Also recall the standard C string functions strlen, strncmp, and sscanf. This is the only place strlen and strncmp are used, and sscanf will only be used once. None the less, consult your C++ text books on these functions first, then ask Dr. Google if you still have questions.

§2.2. Our Parsing Functions: Overview

Recall our BNF. It has 9 rules and defines 9 expression types (Formula, Constant, etc.). We'll use this BNF and write the following 9 functions, one for each expression type:

   bool Formula(char *s[]);
   bool Constant(char *s[]);
   bool Proposition(char *s[]);
   bool UnaryFormula(char *s[]);
   bool BinaryFormula(char *s[]);
   bool LeftParen(char *s[]);
   bool RightParen(char *s[]);
   bool UnaryOperator(char *s[]);
   bool BinaryOperator(char *s[]);

Recall from the start of this section how each of these functions is going to behave.

   bool Expression(char *s[]);
is going to read string s and IF it contains Expression THEN it will advance s past the expression and return true. OTHERWISE it leaves s unchanged and returns false.

The Formula function will be our “top-level” function that we'll use for testing if a given string contains a wff of propositional logic. (As we'll see later, we'll actually need one more on top of this...but I don't want to get into that right now).

The definition of each function is very closely based upon its rule in our BNF. Let's get onto our function definitions.

§2.2.1. Implementation: Formula

Recall that our rule for Formula in our BNF was a Choice rule. A Formula was either a constant, a proposition, a unary-formula, or a binary-formula. Our function does nothing more than check that:

   bool Formula(char *s[]) {

      char *original = *s;
      SkipWhitespace(s);

      if (Constant(s) || Proposition(s) || UnaryFormula(s) || BinaryFormula(s))
         return true;

      *s = original;
      return false;
   }

Kinda anti-climatic, isn't it? Let's look at what it does in detail. First, it creates a copy of s (a pointer) stored in original. Then it skips past any white space, so that s is now pointing to something. Then it asks if that something is a constant, proposition, unary-formula, or binary-formula. If so, it returns true. Otherwise, we reset s (remember that SkipWhitespace changed s) and return false.

§2.2.2. Implementation: UnaryFormula & BinaryFormula

Now to our functions for UnaryFormula and BinaryFormula. These rules in our BNF represent a Sequence of patterns, and we must check each pattern in the same order they appear in the BNF. If any of the pattern checks fail, we reset s to whatever it was originally, and return false.

   /* identify a unary formula (see BNF) */
   bool UnaryFormula(char *s[]) {

      char *original = *s;
      SkipWhitespace(s);
   
      if (!LeftParen(s)) {
         *s = original;
         return false;
      }
   
      if (!UnaryOperator(s)) {
         *s = original;
         return false;
      }
   
      if (!Formula(s)) {
         *s = original;
         return false;
      }
   
      if (!RightParen(s)) {
         *s = original;
         return false;
      }
   
      return true;
   }


   /* identify a binary formula (see BNF) */
   bool BinaryFormula(char *s[]) {

      char *original = *s;
      SkipWhitespace(s);
   
      if (!LeftParen(s)) {
         *s = original;
         return false;
      }
   
      if (!BinaryOperator(s)) {
         *s = original;
         return false;
      }
   
      if (!Formula(s)) {
         *s = original;
         return false;
      }
   
      if (!Formula(s)) {
         *s = original;
         return false;
      }
   
      if (!RightParen(s)) {
         *s = original;
         return false;
      }
   
      return true;
   }

Again, rather anti-climatic still... where is all the parsing going on? As you'll see, each function is very, very simple. All the real work is done in SkipWhitespace and Match. Our parsing functions for each expression just add the “logic” from the BNF.

Study the above and make sure you understand the logic. Each function is checking the sequence of patterns specified in its BNF, in the same order they appear in the BNF. If any of them fail, they reset s and return false. If all of them succeed, we return true.

§2.2.3. Implementation: LeftParen, RightParen, & Constant

Recall that these three expressions in the BNF all contained Literal Text, and the Constant rule also contained a Choice operator. This will be reflected in their implementation.

/* Identify a left-parenthesis. Note this is a LITERAL TEXT rule. */
bool LeftParen(char *s[]) {

   char *original = *s;
   SkipWhitespace(s);

   if (match(s, "("))
      return true;

   *s = original;
   return false;
}

/* Identify a right-parenthesis. Note this is a LITERAL TEXT rule. */
bool RightParen(char *s[]) {

   char *original = *s;
   SkipWhitespace(s);

   if (match(s, ")"))
      return true;

   *s = original;
   return false;
}

/* Identify a constant (either 'T' or 'F', that's it).
 * Note this is a CHOICE rule between two LITERAL TEXT items.*/
bool Constant(char *s[]) {

   char *original = *s;
   SkipWhitespace(s);

   if (match(s, "T") || match(s, "F"))
      return true;

   *s = original;
   return false;
}

Each of these functions makes use of the match helper function because each of these rules contains Literal Text in their BNF. As we'll see, the only functions that ever do any actual string matching are the ones corresponding to rules with literal text in the BNF.

§2.2.4. Implementation: Proposition

Now let's look at the function for Proposition...

bool Proposition(char *s[]) {

   char *original = *s;
   SkipWhitespace(s);

   /* these are used in sscanf to record what was read from string */
   char prop[MAX_PROP_LENGTH];
   int  numCharsRead;

   if (sscanf(*s, "%[a-z0-9]%n", &prop, &numCharsRead) == 1) {
      *s = *s + numCharsRead;
      return true;
   }

   *s = original;
   return false;
}

Here we're skipping all whitespace (as usual) and then using the standard C-library string function sscanf to do our simple pattern match against s, and extract that into a string called prop. We also use the %n feature of sscanf to tell us how many characters long the pattern it extracts is, if any, and use that to advance the string pointer s.

Note that MAX_PROP_LENGTH is a global constant I have set to 100.

This is about the most complicated function we define.

§2.2.5. Implementation: UnaryOperator & BinaryOperator

Now for the UnaryOperator and BinaryOperator functions. To make my life easy, I have the following global constants defined:

   /* some string constants (so i don't have to type these out ever again, and
    * they can be easily changed) */
   const char *STRING_NOT     = "\\neg";
   const char *STRING_OR      = "\\vee";
   const char *STRING_AND     = "\\wedge";
   const char *STRING_IMPLIES = "\\rightarrow";
   const char *STRING_IFF     = "\\leftrightarrow";

Now for the functions. As stated above, since these rules contain literal text in the BNF, we'll be using our match function.

/* Identify a unary operator (currently only '\neg').
 * Note this is a LITERAL TEXT rule. */
bool UnaryOperator(char *s[]) {

   char *original = *s;
   SkipWhitespace(s);

   if (match(s, STRING_NOT))
      return true;

   *s = original;
   return false;
}

/* Identify a binary operator (either '\vee', '\wedge', etc.)
 * Note this is a CHOICE rule among LITERAL TEXT options. */
bool BinaryOperator(char *s[]) {

   char *original = *s;
   SkipWhitespace(s);

   if (match(s, STRING_OR) || match(s, STRING_AND)
   ||  match(s, STRING_IMPLIES) || match(s, STRING_IFF))
      return true;

   *s = original;
   return false;
}

And that's it. Seriously. Well, almost... but what's left is really easy!

§2.3. Putting a Wrapper around Formula

We're almost done. Our function Formula will correctly identify if a wff occurs in the string provided, but there's one problem...

Suppose our string contains a wff followed by some extra stuff. For example, consider the string:

s="(\leftrightarrow (\neg p) (\vee q T))       :("

Our Formula function will recognize the first part of the string that is a wff and advance s to the first space after it, and return true. Remember the semantics of our functions.

This is obviously not correct since s contains, after a wff, something that makes it not a wff. There are a few ways to fix it. We could update our BNF, but after writing all of this, I really don't feel like doing that. Instead, we'll write a “wrapper” function around Formula that we'll use as our top-level wff-checking function, and add the logic to make sure that, if Formula returns true, the rest of the string is empty.

   bool FormulaWrapper(char *s[]) {

      char *original = *s;

      if (Formula(s)) {
         SkipWhitespace(s);
         if (*s[0] == '\0')
            return true;
         else {
            /* there was "junk" after the formula! */
            *s = original;
            return false;
         }
      }
      return false;
   }

§3. Full Implementation

You can view an HTML-ized version of all of the code above here: parser.c.html. It includes a simple driver program that, when run, will repeatedly prompt the user to enter a propositional logic formula (on a single line) up to a 1000 characters long, and each time the program will tell you if what was entered is a wff or not.

Sample output from my program (user input in red):

[email protected]> ./parser
Enter formula of proposition logic, one per line (max 1000 characters)
Each time I'll tell you if it's a wff.  Hit CTRL+D to exit.
T
===> it's a wff :)
p
===> it's a wff :)
(\neg p)
===> it's a wff :)
(\neg p p)
===> NOT A WFF :(
(\vee T T)
===> it's a wff :)
(\leftrightarrow (\vee p (\neg q)) (\wedge (\leftrightarrow r s) T))
===> it's a wff :)
(\leftrightarrow (\vee p (\neg q)) (\wedge (\leftrightarrow r s) T))      stuff
===> NOT A WFF :(

§4. Comments on RDP and other Parsers

As you can see, Recursive Descent Parsing is really quite easy to implement and that is its key strength. Additionally, when you have to extract information during the parsing (as you do with this assignment, more on that in the next section), it's very easy to modify an RDP parser to do this. Here's a short list of the pros/cons of RDP parsers that comes to mind right now:

PROS of Recursive Descent Parsing:
CONS of Recursive Descent Parsing:

If you're in a situation where RDP isn't appropriate, there are many other options available for you. For example, you could use a parser generator to build a parser for you. This is a program (or series of programs) that you provide something like a BNF to, and it actually spits out a chunk of code in C/C++/Java/etc. that does much of the parsing for you.

One of the more popular parser generators available are Lex and Yacc (*nix implementations are called Flex and Bison). The only downside to using such tools is the complexity. They are industrial strength parser generators, with many, many bells and whistles. Learning and using them effectively can take quite a long time.

For functional languages such as Haskell, there are many options. Most of them are very simple to use and, if you're already good with using languages like Haskell, you can hit-the-ground-running with them.

§5. Using Infix Notation

Suppose you wanted to allow binary operators to use infix notation rather than prefix (Polish) notation. How would go about modifying the parser built above to handle this?

Hint: it's exceptionally easy!

Notice we only have to update one rule in our BNF (the rule for BinaryFormula). You then only have to modify the corresponding function BinaryFormula in a rather trivial way. Can you see how?

§6. This is all Great... but how do I Extract Information While Parsing?

The above only explains how to recognize if a given string contains a wff of propositional logic, in the format prescribed for the assignment. Your task was to read in such a formula and then generate its truth table. Now, recognizing that a given piece of text (from an input file) represents a wff is obviously important, but how do you use this to build some structure in C/C++ (or whatever language) that you can use/manipulate like a propositional formula in order to build the truth table?

With a little thought this is quite easy. First, figure out how you're going to store this information internally. Notice that in representing any logical formula, you should be able to represent

Once you have a structure capable of representing all of the above, you can then modify some of the parsing functions to include an additional parameter... say, a pointer to your structure. For example,

   bool Formula(char *s[], LogicalFormula *f);
   bool Constant(char *s[], LogicalFormula *f);
   bool Proposition(char *s[], LogicalFormula *f);
   bool UnaryFormula(char *s[], LogicalFormula *f);
   bool BinaryFormula(char *s[], LogicalFormula *f);

In the case that any of these parse functions return true, and you know you have a wff of the appropriate type, you then set f to point to a new object of the appropriate structure.

I'm not going to go into any more detail than this because at this point, this is something you should know how to do. None the less, if any of you would like to see how I did it, stop by my office hours and I'll be happy to show you (I will NOT, however, post the code online or give you a copy of my code in any form).


October 2009 by Ryan Flannery <[email protected]>