In a class that I am currently TA'ing the students were assigned to write a program that would read in a formula of propositional logic and then print out its truth table. Few students had difficulty with the “logic” portion of the assignment, but many seemed to struggle with the parsing.

The language that the propositional logic formula was to be provided in was
intentionally made to be rather simple... simple enough that a ** recursive
descent parser** (

Of course, I have only finished this document *after* the
assignment was due... so no help to the students... Hahahahah!
(sorry)

**Important Note:** Recursive Descent Parsing is really rather simple,
even when done in a language like C/C++ (it's even easier in functional
languages like Haskell). After someone sees how to do it *once*, they
can usually implement a RDP as quickly as they can type.

For parsing in
general, there are many tools out there that will actually generate a
parser for you, in whatever language you like (e.g. Lex & Yacc for
C/C++/Java, Alex/Happy/Parsec for Haskell, etc.). More on these later.
In this case, unless you were already familiar with any of these other
tools, writing a simple RDP from scratch was probably your easiest bet.

*Also, it's just-plain-good-to-know for any CS student!*

Looking at the language of propositional logic, we see we have a few
“basic” formula like constants (` T` and

The idea behind the language, and what constitutes a ** well formed
formula** (wff) is simple, but to state it precisely, we'll
use a

In fact, it turns out if *if* you can represent a given language
as a BNF (with certain properties imposed on the BNF that I'm waving my
hands over), *then*
you can always parse (or “recognize”) that language using
a recursive descent parser.

First, let's take a look at the BNF and then each line will be explained. Note that the logical connectives are expressed as their LaTeX command equivalents, which is what the assignment stipulated. Additionally, the operators are in prefix (Polish) notation.

Formula::=Constant|Proposition|UnaryFormula|BinaryFormulaConstant::= "T" | "F"Proposition::= [a-z0-9]+UnaryFormula::=LeftParenUnaryOperatorFormulaRightParenBinaryFormula::=LeftParenBinaryOperatorFormulaFormulaRightParenLeftParen::= "("RightParen::= ")"UnaryOperator::= "\neg"BinaryOperator::= "\vee" | "\wedge" | "\rightarrow" | "\leftrightarrow"

The BNF is composed of 9 *rules*, with each rule on a single line.
Each rule is of the form:

and tells you howExpression::=definition

**Choice.** The vertical bar
“|” denotes “or”, and says that the pattern may
be either of the sub-patterns listed. So, the rule:

tells us aFormula::=Constant|Proposition|UnaryFormula|BinaryFormula

**Sequence.** Spaces between sub-patterns in the definition denote that the
provided expression is built from the explicit sequence of patterns. So the
rule:

tells us aUnaryFormula::=LeftParenUnaryOperatorFormulaRightParen

**Literal Text.** Red text denotes literal strings. So the rules:

tell us that aLeftParen::= "("Constant::= "T" | "F"

**Text Pattern.** The blue pattern used to define *Proposition*,

tells us that a proposition is any sequence of one-or-more lower-case letters and numbers. Note that is really just shorthand for more BNF rules such as the following:Proposition::= [a-z0-9]+

and we could, if we were inclined, expand our BNF into more rules. Because, however, this type of string (only lower-case letters and numbers) does NOT overlap with any other rule (constants are capital letters, parenthesis aren't letters, and operators start with a slash), we'll keep this rule as a simple pattern and useProposition::=LetterOrNumberStringString::=LetterOrNumberString| 0LetterOrNumber::= "a" | "b" | "c" | ... | "z" | "0" | "1" | ... | "9"

**Putting it All Together.** Note the following key ideas from above:

- Rules are defined in terms of each other! They are
*recursive!*

That is,*Formula*is defined in terms of (among other things)*UnaryFormula*, and*UnaryFormula*is defined in terms of*Formula*. - As we'll see in our implementation, we'll construct a set of mutually-recursive functions that mirror this BNF very closely.
- Also, note that the only rules that actually define any literal text
are the LeftParen, RightParen, Constant, Proposition, and Operator
rules. As we'll see in the implementation, these are also the only
functions that do any
*actual*reading of our input string.

**Note:** Make sure you understand the above. Once you have an understanding of the
BNF, you can use it to implement a RDP for our propositional logic language very
*very quickly*, as we'll see in the next section.

To implement an RDP in C, we're going to use the following approach. Store
all of the formula in a single string, of type ** char[]** (i.e.
a C-style string). If the formula spans multiple lines, read them into a
single string.

Next, we're going to implement a series of functions, one for each rule in
our BNF (defining an expression). Each function will behave
*identically*, in the following manner. Given a rule:

Expression::=definition

We'll define a function

bool Expression(char *s[]);

That takes in the input string **s** and returns a bool. *IF* the input
matches the expression's definition, *THEN* true will be returned and the
input string pointer **s** will be advanced past the end of the expression in
the string. *OTHERWISE* (if the input string does not match the expression),
false will be returned and the input string pointer **s** will remain unchanged.

Our functions will be eerily similar to the BNF rules they correspond to, and also surprisingly simple. Additionally, they will all follow a similar pattern.

Before we get to the implementation, we define two helper functions that we'll use extensively.

The first is **SkipWhitespace**, which takes in the input string pointer
**s** and simply advances it past any whitespace (spaces, tabs, and newlines).
Using this, we can allow arbitrary spacing in our expressions. In most
RDP literature, this corresponds to the **NextToken** function... I just
think this name is a bit more descriptive for our case.

void SkipWhitespace(char *s[]) { while (*s[0] == ' ' || *s[0] == '\n' || *s[0] == '\t') *s = *s + 1; }

The next helper function is **match**, which takes in the input string
pointer **s** and a fixed, known string named **token**. It compares the
first ** strlen(token)** characters of

bool match(char *s[], const char *token) { if (strncmp(*s, token, strlen(token)) == 0) { *s = *s + strlen(token); return true; } return false; }

**Note:** Recall your simple pointer arithmetic! Also recall that in C and C++, an
array name is simply a pointer to the first element of the array. That's the pointer we're
advancing through the string. Don't worry, it's never very complicated.

Also recall the standard C string functions ** strlen**,

Recall our BNF. It has 9 rules and defines 9 expression types (Formula, Constant, etc.). We'll use this BNF and write the following 9 functions, one for each expression type:

bool Formula(char *s[]); bool Constant(char *s[]); bool Proposition(char *s[]); bool UnaryFormula(char *s[]); bool BinaryFormula(char *s[]); bool LeftParen(char *s[]); bool RightParen(char *s[]); bool UnaryOperator(char *s[]); bool BinaryOperator(char *s[]);

Recall from the start of this section how each of these functions is going to behave.

boolis going to read stringExpression(char *s[]);

The **Formula** function will be our “top-level” function that we'll use for
testing if a given string contains a wff of propositional logic. (As we'll see later,
we'll actually need one more on top of this...but I don't want to get into that right now).

The definition of each function is *very closely* based upon its rule in our BNF.
Let's get onto our function definitions.

Recall that our rule for *Formula* in our BNF was a **Choice** rule.
A *Formula* was either a constant,
a proposition, a unary-formula, or a binary-formula. Our function does nothing more
than check that:

bool Formula(char *s[]) { char *original = *s; SkipWhitespace(s); if (Constant(s) || Proposition(s) || UnaryFormula(s) || BinaryFormula(s)) return true; *s = original; return false; }

Kinda anti-climatic, isn't it? Let's look at what it does in detail. First, it creates a
copy of **s** (a pointer) stored in **original**. Then it skips past any white space, so that
**s** is now pointing to *something*. Then it asks if that something is a
constant, proposition, unary-formula, or binary-formula. If so, it returns true.
Otherwise, we reset **s** (remember that **SkipWhitespace** changed
**s**) and return false.

Now to our functions for UnaryFormula and BinaryFormula. These rules in our BNF
represent a **Sequence** of patterns, and we must check each pattern in the
same order they appear in the BNF. If *any* of the pattern checks fail, we
reset **s** to whatever it was originally, and return false.

/* identify a unary formula (see BNF) */ bool UnaryFormula(char *s[]) { char *original = *s; SkipWhitespace(s); if (!LeftParen(s)) { *s = original; return false; } if (!UnaryOperator(s)) { *s = original; return false; } if (!Formula(s)) { *s = original; return false; } if (!RightParen(s)) { *s = original; return false; } return true; } /* identify a binary formula (see BNF) */ bool BinaryFormula(char *s[]) { char *original = *s; SkipWhitespace(s); if (!LeftParen(s)) { *s = original; return false; } if (!BinaryOperator(s)) { *s = original; return false; } if (!Formula(s)) { *s = original; return false; } if (!Formula(s)) { *s = original; return false; } if (!RightParen(s)) { *s = original; return false; } return true; }

Again, rather anti-climatic still... where is all the parsing going on?
As you'll see, each function is very, very simple. All the real work is done in
**SkipWhitespace** and **Match**. Our parsing functions for each expression
just add the “logic” from the BNF.

Study the above and make sure you understand the logic. Each function is checking
the *sequence* of patterns specified in its BNF, in the same order they appear
in the BNF. If any of them fail, they reset **s** and return false. If all of
them succeed, we return true.

Recall that these three expressions in the BNF all contained **Literal Text**,
and the *Constant* rule also contained a **Choice** operator. This will be
reflected in their implementation.

/* Identify a left-parenthesis. Note this is a LITERAL TEXT rule. */ bool LeftParen(char *s[]) { char *original = *s; SkipWhitespace(s); if (match(s, "(")) return true; *s = original; return false; } /* Identify a right-parenthesis. Note this is a LITERAL TEXT rule. */ bool RightParen(char *s[]) { char *original = *s; SkipWhitespace(s); if (match(s, ")")) return true; *s = original; return false; } /* Identify a constant (either 'T' or 'F', that's it). * Note this is a CHOICE rule between two LITERAL TEXT items.*/ bool Constant(char *s[]) { char *original = *s; SkipWhitespace(s); if (match(s, "T") || match(s, "F")) return true; *s = original; return false; }

Each of these functions makes use of the **match** helper function because each of
these rules contains **Literal Text** in their BNF. As we'll see, the only functions
that ever do any actual string matching are the ones corresponding to rules with
literal text in the BNF.

Now let's look at the function for Proposition...

bool Proposition(char *s[]) { char *original = *s; SkipWhitespace(s); /* these are used in sscanf to record what was read from string */ char prop[MAX_PROP_LENGTH]; int numCharsRead; if (sscanf(*s, "%[a-z0-9]%n", &prop, &numCharsRead) == 1) { *s = *s + numCharsRead; return true; } *s = original; return false; }

Here we're skipping all whitespace (as usual) and then using the standard C-library
string function ** sscanf** to do our simple pattern match against

Note that **MAX_PROP_LENGTH** is a global constant I have set to 100.

This is about the most complicated function we define.

Now for the UnaryOperator and BinaryOperator functions. To make my life easy, I have the following global constants defined:

/* some string constants (so i don't have to type these out ever again, and * they can be easily changed) */ const char *STRING_NOT = "\\neg"; const char *STRING_OR = "\\vee"; const char *STRING_AND = "\\wedge"; const char *STRING_IMPLIES = "\\rightarrow"; const char *STRING_IFF = "\\leftrightarrow";

Now for the functions. As stated above, since these rules contain literal text
in the BNF, we'll be using our **match** function.

/* Identify a unary operator (currently only '\neg'). * Note this is a LITERAL TEXT rule. */ bool UnaryOperator(char *s[]) { char *original = *s; SkipWhitespace(s); if (match(s, STRING_NOT)) return true; *s = original; return false; } /* Identify a binary operator (either '\vee', '\wedge', etc.) * Note this is a CHOICE rule among LITERAL TEXT options. */ bool BinaryOperator(char *s[]) { char *original = *s; SkipWhitespace(s); if (match(s, STRING_OR) || match(s, STRING_AND) || match(s, STRING_IMPLIES) || match(s, STRING_IFF)) return true; *s = original; return false; }

*And that's it. Seriously. Well, almost... but what's left is really easy!*

We're almost done. Our function **Formula** will correctly identify if a wff
occurs in the string provided, but there's one problem...

Suppose our string contains a wff followed by some extra stuff. For example, consider the string:

`s="(\leftrightarrow (\neg p) (\vee q T)) :("`

Our **Formula** function will recognize the first part of the string that is a wff and
advance **s** to the first space after it, and return true. Remember the semantics of
our functions.

This is obviously not correct since **s** contains, after a wff, something that makes it
*not a wff*. There are a few ways to fix it. We could update our
BNF, but after writing all of this, I really don't feel like doing that. Instead, we'll
write a “wrapper” function around **Formula** that we'll use as our
top-level wff-checking function, and add the logic to make sure that, if
**Formula** returns true, the rest of the string is empty.

bool FormulaWrapper(char *s[]) { char *original = *s; if (Formula(s)) { SkipWhitespace(s); if (*s[0] == '\0') return true; else { /* there was "junk" after the formula! */ *s = original; return false; } } return false; }

You can view an HTML-ized version of all of the code above here: parser.c.html. It includes a simple driver program that, when run, will repeatedly prompt the user to enter a propositional logic formula (on a single line) up to a 1000 characters long, and each time the program will tell you if what was entered is a wff or not.

Sample output from my program (user input in red):

ryan@barwise> ./parser Enter formula of proposition logic, one per line (max 1000 characters) Each time I'll tell you if it's a wff. Hit CTRL+D to exit. T ===> it's a wff :) p ===> it's a wff :) (\neg p) ===> it's a wff :) (\neg p p) ===> NOT A WFF :( (\vee T T) ===> it's a wff :) (\leftrightarrow (\vee p (\neg q)) (\wedge (\leftrightarrow r s) T)) ===> it's a wff :) (\leftrightarrow (\vee p (\neg q)) (\wedge (\leftrightarrow r s) T)) stuff ===> NOT A WFF :(

As you can see, Recursive Descent Parsing is really quite easy to implement and that is its key strength. Additionally, when you have to extract information during the parsing (as you do with this assignment, more on that in the next section), it's very easy to modify an RDP parser to do this. Here's a short list of the pros/cons of RDP parsers that comes to mind right now:

**PROS of Recursive Descent Parsing:**-
**Quick & Easy to Implement.**It's said you can implement them as fast as you can type them. This really is true,*once you see how they are constructed.***Easy to Extract Information From.**Once you have a RDP parser written, it's fairly easy to update it to actually extract the information you want from the text you're parsing.

**CONS of Recursive Descent Parsing:**-
**No Easy Way to provide Good Error Messages.**Ideally, if a user had a bad expression with, say, an extra open parenthesis, or too many sub-formula in a unary/binary expression, you would like to provide a nice error message, telling them exactly what's wrong. This is either very difficult with RDP's or simply impossible.**RDP isn't always Possible.**As mentioned briefly in the theory section, BNF's can only capture certain types of languages, and RDP only works (easily) for these languages. For more complicated languages (e.g. if you wanted to parse a C/C++ source file), RDP usually isn't possible.

If you're in a situation where RDP isn't appropriate, there are many other options
available for you. For example, you could use a **parser generator** to build
a parser for you. This is a program (or series of programs) that you provide
something like a BNF to, and it actually spits out a chunk of code in C/C++/Java/etc.
that does much of the parsing for you.

One of the more popular parser generators available are Lex and Yacc (*nix
implementations are called Flex and Bison). The only downside to using such tools
is the *complexity*. They are industrial strength parser generators, with
many, many bells and whistles. Learning and using them effectively can take quite
a long time.

For functional languages such as Haskell, there are many options. Most of them are very simple to use and, if you're already good with using languages like Haskell, you can hit-the-ground-running with them.

Suppose you wanted to allow binary operators to use *infix* notation rather
than *prefix* (Polish) notation.
How would go about modifying the parser built above to handle this?

**Hint:** it's exceptionally easy!

Notice we only have to update one rule in our BNF (the rule for *BinaryFormula*).
You then only have to modify the corresponding function **BinaryFormula** in a
rather trivial way. Can you see how?

The above only explains how to *recognize* if a given string contains a wff of
propositional logic, in the format prescribed for the assignment. Your task was to
read in such a formula and then generate its truth table. Now, recognizing that a
given piece of text (from an input file) represents a wff is obviously important,
but how do you *use* this to build some structure in C/C++ (or whatever language)
that you can use/manipulate like a propositional formula in order to build the
truth table?

With a little thought this is quite easy. First, figure out how you're going to store this information internally. Notice that in representing any logical formula, you should be able to represent

- Constants
- Propositions
- Negations of a wff (the sole unary connective)
- And various binary conjunctions of wff's

Once you have a structure capable of representing all of the above, you can then
modify some of the parsing functions to include an additional parameter... say,
a *pointer to your structure.* For example,

bool Formula(char *s[], LogicalFormula *f); bool Constant(char *s[], LogicalFormula *f); bool Proposition(char *s[], LogicalFormula *f); bool UnaryFormula(char *s[], LogicalFormula *f); bool BinaryFormula(char *s[], LogicalFormula *f);

In the case that any of these parse functions return true, and you know you have a wff of the
appropriate type, you then set **f** to point to a new object of the appropriate structure.

I'm not going to go into any more detail than this because at this point, this is something you should know how to do. None the less, if any of you would like to see how I did it, stop by my office hours and I'll be happy to show you (I will NOT, however, post the code online or give you a copy of my code in any form).

October 2009 by Ryan Flannery <ryan.flannery@gmail.com>