While I was on vacation recently, I played around with implementing generative context-free grammars in C++. I have done this in other languages (pure functional ones) and found it a lot easier there. I thought I should to try to write a context-free grammar implementation in C++ that could produce a compile-time sentence. This would of course require a lot of metaprogramming and that would be challenging and new for me, so I decided to do it. I wrote it in C++17, and you’ll see a couple of newer features (constexpr if and variable templates) if you pay close attention.

The goal is this:

  • given a grammar G = (T, N, R, S) = (terminals, nonterminals, rules, start symbol)
  • and a random seed r
  • expand S into a sentence (string of symbols with no nonterminals)
  • and provide a compile-time string (i.e. const char *) representation of that sentence

When I approach a task like this I usually like to start by working backward from the syntax I would like to see. This is especially important for me in C++ because the notation can quickly grow verbose and become pretty dissatisfying. Besides, having a view of my destination is like having a postcard from a friend that makes you wish you were on some tropical island with them.

This is the notation I liked:

Grammar<
    S,
    Rule<S, A>,
    Rule<A, A, B>,
    Rule<A, a>,
    Rule<B, b>
> g;

There are some implications here, namely that the first template parameter of the grammar is the start symbol and that the first template parameter of each rule is its left-hand side. These seem obvious to me and seemed to seem obvious to anyone familiar with grammars so I decided not to complicate the syntax by introducing extra layers of nesting.

The full code can be found here. I’d like to explain a couple of the more interesting points.

As a convention that some use in C++ and that I also enjoy from prolog, I use mostly single-letter capital letters for parameter names and add an ‘s’ when it is a parameter pack. Also, when I treat a struct as a metaprogramming function, I name the type that represents the output of that function Result.

Utility templates

I defined a couple utility templates to make things easier. First, a type list and some operations on lists:

template<typename... Ts> struct List {
    constexpr static auto Size = sizeof...(Ts);
};

template<typename...> struct ListCat;
template<typename... Ls, typename ... Rs>
struct ListCat<List<Ls...>, List<Rs...>> { using Result = List<Ls..., Rs...>; };

template<typename, typename> struct ListAppend;
template<typename Head, typename... Tail>
struct ListAppend<Head, List<Tail...>> { using Result = List<Head, Tail...>; };

Typically, I store lists of rules and symbols in List types.

I also used a recurisve Select template to choose the nth element from a list:

template<size_t, typename> struct Select;

template<size_t I, typename Head, typename... Tail>
struct Select<I, List<Head, Tail...>> { using Result = typename Select<I - 1, List<Tail...>>::Result; };

template<typename Head, typename... Tail>
struct Select<0, List<Head, Tail...>> { using Result = Head; };

This comes in handy for rule selection.

Symbol types

Symbols (nonterminals and terminals) are structs with the following static members:

struct FooSymbol {
    static constexpr const char * Name = "Foo";
    static constexpr bool IsTerminal = true;
};

Name is a string representation of the symbol (for instance Plus::Name = “+”). IsTerminal is used only to indicate whether the symbol is a terminal or nonterminal.

If I were interested in being very robust, I would instead (or in addition) provide a way to indicate name and terminal-ness via template specializations. For instance:

template<>
constexpr const char * SymbolName<FooSymbol> = "Foo";

template<>
constexpr bool SymbolIsTerminal<FooSymbol> = true;

This would make it easy to hook the grammar system into arbitrary type systems, for instance with third-party types that can’t be modified.

Rule type

For now, I represent rules this way:

template<typename L, typename... Rs>
struct Rule {
    using Lhs = L;
    using Rhs = List<Rs...>;
};

Later in this post I’ll be adding weights to my rule concept.

Rule application

To generate the final sentence, I start with the initial symbol and expand recusively until the resulting string comprises only terminal symbols. To avoid extra template instantiation that happens with std::conditional, I select the template based on a boolean template parameter.

// Expand entire string repeatedly until all terminal.
template<size_t, typename, typename, bool> struct ExpandImpl;

// Base case - done expanding
template<size_t InI, typename ExpList, typename RList>
struct ExpandImpl<InI, ExpList, RList, true> { using Result = ExpList; };

template<size_t InI, typename ... Ss, typename RList>
struct ExpandImpl<InI, List<Ss...>, RList, false> {
    using Expanded = ExpandOneSent<InI, List<>, List<Ss...>, RList>;
    using PrevResult = typename Expanded::Result;
    constexpr static auto I = Expanded::I;
    using Result = typename ExpandImpl<I, PrevResult, RList,
                                       IsAllTerminal<PrevResult>::Result || (sizeof...(Ss) > 100)>::Result;
};

The parameters in order - first, a size_t RNG state variable for selecting rules. Then the list of symbols to expand. Third, the rule list. Then the boolean tag that represents whether we’re at the base case.

The non-base case calls down to yet another template (ExpandOneSent) that’s designed to expand a sentential form once. The new sentential form is stored in PrevResult. The ExpandOneSent struct also contains a static member, I, which is the new RNG state. IsAllTerminal is a utility template that checks whether all the types in the list parameter are terminals or not.

ExpandOneSent looks like this:

// Expand an entire list once.
template<size_t, typename, typename, typename> struct ExpandOneSent;

// Base case - nothing left to expand
template<size_t InI, typename ExpList, typename RList>
struct ExpandOneSent<InI, ExpList, List<>, RList> {
    constexpr static auto I = InI;
    using Result = ExpList;
};

// Expand Next and place its expansion in List<Prev..., _>
template<size_t InI, typename... Prev, typename Next, typename ...Tail, typename RList>
struct ExpandOneSent<InI, List<Prev...>, List<Next, Tail...>, RList>
{
    using ExpandedNext = ExpandSymbol<InI, Next, RList, Next::IsTerminal>;
    constexpr static auto NextI = ExpandedNext::NextI;
    using NewHead = typename ListCat<List<Prev...>, typename ExpandedNext::Result>::Result;
    using ExpandedTail = ExpandOneSent<NextI, NewHead, List<Tail...>, RList>;
    using Result = typename ExpandedTail::Result;
    constexpr static auto I = ExpandedTail::I;
};

Again, implemented recursively with a separate base case. This time the base case can be selected easily pattern matching on the template arguments alone, so there is no need for a separate boolean template parameter.

ExpandOneSent takes an RNG state, an input list, and the rule list, and places its output in the second template parameter (List<Prev...> or ExpList). In each recursive pass it expands one symbol (ExpandSymbol) from the front of the input list and concatenates the result with the output from the previous step.

You will notice that I always try to expand the symbol regardless of whether it’s a terminal or nonterminal. I leave that selection up to ExpandSymbol itself. Here’s the implementation:

// Expand a single non-terminal, or don't if it's a terminal
template<size_t, typename, typename, bool> struct ExpandSymbol;

// I = RNG value, S = symbol to expand, RList = list of rules
template<size_t I, typename S, typename RList>
struct ExpandSymbol<I, S, RList, false> {
    using Matches = typename MatchingRules<S, RList>::Result;
    using Result = typename Select<I % Matches::Size, Matches>::Result::Rhs;
    constexpr static auto NextI = CalcNextI(I);
};

template<size_t I, typename S, typename Rs>
struct ExpandSymbol<I, S, Rs, true> {
    using Result = List<S>;
    constexpr static auto NextI = I;
};

Here is the actual use of the RNG state. I do a naive thing and select the rule with RNG mod \#rules. In order to select the application rule, I first collect all the rules that share a common left-hand side with MatchingRules, then use my Select utility template to choose among them.

The nonterminal case of ExpandSymbol is the only spot where I consume RNG state, so it’s also the only place where I alter it. CalcNextI is a simple xorshift algorithm.

MatchingRules is yet another recursive template, a bit simpler than the others.

// base case chosen when list is empty
template<typename S, typename RList> struct MatchingRules { using Result = RList; };
template<typename S, typename R, typename...Rs>
struct MatchingRules<S, List<R, Rs...>> {
    using Tail = typename MatchingRules<S, List<Rs...>>::Result;
    using Result = std::conditional_t<std::is_same_v<S, typename R::Lhs>,
          typename ListAppend<R, Tail>::Result, Tail>;
};

In this case I went ahead and used std::conditional even though it generates extra unused type lists. In the interest of brevity.

I also created a helper template used by the actual Grammar type to produce a sentence:

// Helper template for Expanding
template<size_t I, typename S, typename RList>
struct Expand {
    using Result = typename ExpandImpl<I, List<S>, RList, S::IsTerminal>::Result;
};

To recap, the hierarchy of templates is:

  • Expand - helper interface to ExpandImpl
  • ExpandImpl - recursively applies rules until no more nonterminals remain
  • ExpandOneSent - expands a single sentential form once
  • ExpandSymbol - expands a single symbol, choosing psuedorandomly among its possible productions
  • MatchingRules - collects all the matching rules for a given left-hand side nonterminal

Now we have a way to generate a sentence from a grammar and a starting seed:

template<size_t I, typename S, typename...Rs>
using Sentence = typename Expand<I, S, List<Rs...>>::Result;

But, we still need to turn this sentence into a compile-time string literal.

String literal concatenation

Now, the goal is to take the Name member of each of the symbol types in the result sentence and concatenate them into a single string.

If I were going to write ConcatImpl as a runtime function, it might look like this:

// assume `out` points to storage large enough to hold the contents of `next.name`.
void ConcatImpl(const char * out, const Symbol & next) {
    auto len = std::strlen(next.name);
    for (size_t i = 0; i < len; ++i)
        out[i] = next.name[i];
    out[len] = '\0';
}

and we could call it like this:

// assume `out` has enough storage to hold the contents of all the strings in `sentence`
void Concat(const char * out, const SymbolList & sentence) {
    for (const auto & sym : sentence) {
        ConcatImpl(out, sym);
        out += std::strlen(sym.name);
    }
}

However, this won’t do for a compile-time type function. We need to call ConcatImpl recursively rather than iteratively, so its second parameter should be a SymbolList. We cannot use strlen because it is not constexpr in C++17, and probably will never be (but we can simply write our own). We cannot use loops to fill storage, because constexpr static strings must be initialized inline in a single brace-enclosed list.

The last of these restrictions made this problem more complicated than I expected, but thanks to a helpful stack overflow answer I found a decent working solution. The basic technique is to use a recursive template, each instantiation of which uses a pair of integer sequence (std::integer_sequence) parameter packs to stitch together a static string, appending a new symbol’s name each time.

As with Expand, this template comes with a helper interface which instantiates the real implementation. I call them Concat and ConcatImpl.

This time I’m going to show the recursive case first:

template<const char *, typename ...> struct ConcatImpl;
template<const char * Str, size_t... LhsI, size_t... RhsI, typename Sym, typename Sym2, typename ...  Syms>
struct ConcatImpl<Str, ISeq<LhsI...>, ISeq<RhsI...>, Sym, Sym2, Syms...> {
    constexpr static size_t Size = sizeof...(LhsI) + sizeof...(RhsI);
    constexpr static const char Partial[Size] = {Str[LhsI]..., Sym::Name[RhsI]... };
    constexpr static const char * String = ConcatImpl<
        Partial, MakeISeq<Size>, MakeISeq<Strlen(Sym2::Name)>, Sym2, Syms...>::String;
};

Note that I have this alias earlier in my implementation:

template<size_t... I> using ISeq = std::index_sequence<I...>;

There’s a lot going on here so I’m going to isolate the template specialization parameter list:

<Str, ISeq<LhsI...>, ISeq<RhsI...>, Sym, Sym2, Syms...>

Here’s how all the pieces are related:

  • Str is the input string from the previous invocation. it contains the joined names of all symbols consumed so far
  • Sym, Sym2, Syms... is the queue of symbols to be processed. Syms is the front element, and in this invocation of ConcatImpl its Name will be appended to Str
  • ISeq<LhsI...> is a sequence of indices from 0 to strlen(Str) - 1 inclusive
  • ISeq<RhsI...> is a sequence of indices from 0 to strlen(Sym::Name) - 1 inclusive

the only way (that I could find) to satisfy the inline brace-enclosed initializer list requirement is by creating these two parameter packs and expanding them inside the braces. luckily, that works.

for example, suppose Str = "Foo" and Sym::Name = "Bar". Then Partial[Size] = {Str[LhsI]..., Sym::Name[RhsI]... }; would be evaluated as:

Partial[Size] = {Str[LhsI]...,           Sym::Name[RhsI]...                       } ;
Partial[Size] = {Str[0], Str[1], Str[2], Sym::Name[0], Sym::Name[1], Sym::Name[2] } ;
Partial[6]    = {'F'   , 'o'   , 'o'   , 'B'         , 'a'         , 'r'          } ;
Partial[6]    = "FooBar";

Note that we don’t care about adding a null terminator until the base case.

One of my biggest frustrations in c++ metaprogramming so far is being unable to give names to parameter packs except by capturing them via class template specializations. In ConcatImpl, the lengths of Str and Sym::Name don’t really need to be known at invocation time. But because of this limitation on parameter packs, our only options are to pass them in, or to defer the work to yet another helper template.

At the end of ConcatImpl, the recursive invocation passes the result string, an integer sequence for the result string, one for the next symbol, and the tail of the symbol list:

    constexpr static const char * String = ConcatImpl<
        Partial, MakeISeq<Size>, MakeISeq<Strlen(Sym2::Name)>, Sym2, Syms...>::String;

MakeISeq is an alias to std::make_index_sequence, and Strlen is simply a constexpr-ified std::strlen.

Here is the base case for ConcatImpl, which adds the final null terminator:

template<const char * Str, size_t... LhsI, size_t... RhsI, typename Sym>
struct ConcatImpl<Str, ISeq<LhsI...>, ISeq<RhsI...>, Sym> {
    constexpr static size_t Size = sizeof...(LhsI) + sizeof...(RhsI);
    constexpr static const char String[Size + 1] = {Str[LhsI]..., Sym::Name[RhsI]..., '\0' };
};

Finally, the implementation of Concat itself:

// Strings together all the names of the symbols in Ss into a single const char array
template<typename> struct Concat;
template<typename S, typename ... Ss>
struct Concat<List<S, Ss...>> {
    constexpr static auto String = ConcatImpl<
        nullptr, MakeISeq<0>, MakeISeq<Strlen(S::Name)>, S, Ss...>::String;
};

We can pass nullptr as the first argument to ConcatImpl since it is never read.

Now, we’re ready to write the top-level interface to the generator:

template<size_t I, typename S, typename... Rs>
struct ProduceImpl
{
    using Result = typename Expand<I, S, List<Rs...>>::Result;
    constexpr static const char * String = Concat<Result>::String;
};

template<typename S, typename ... Rs>
struct Grammar {
    template<size_t I>
    static constexpr auto Production = ProduceImpl<I, S, Rs...>::String;
};

Usage

Using the grammar is quite easy:

using G = Grammar<
    S,
    /* ... */
    >;

void print() {
    std::cout << G::template Production<3> << std::endl;
}

In my example code in the linked repository, I created a helper function to print the result of Production for all the random seeds between 1 and a given argument (seed 0 produces an infinite sequence of 0s with my RNG). I created a toy grammar, and printed out the first 5 results:

// A-E are nonterminals, a-e are terminals
using Toy = Grammar<
    A,
    Rule<A, A, B, C, D, E>,
    Rule<A, C, D, E>,
    Rule<B, b>,
    Rule<B, B, E, C>,
    Rule<C, c>,
    Rule<C, c, d, c, d, c>,
    Rule<C, D, E>,
    Rule<D, d>,
    Rule<D, B, E>,
    Rule<E, e>
>;

int main() {
    std::cout << "toy:\n";
    print<Toy, 5>();
}
[brianheim@BrianMBP constexpr_grammar]$ g++-8 -o main main.cpp -std=c++17
[brianheim@BrianMBP constexpr_grammar]$ ./main
toy:
1: cdcdcbebecdcdcedeecdcdceeee
2: debeebcdebebedeeceeebeecdcdcde
3: cde
4: cdcdcdebedeecdebeebecdcdcedebeedebcdebecdcdccdebcbee
5: debee

Adding weighted rules

The grammar implementation I’ve shown so far is pretty limited by the rule representation. All the generation algorithm can do is select from the possible productions with equal probability. It would be nice to have weights associated with each rule. Since we can’t use floating-point numbers as template arguments, we’ll use integer weights and implicitly make the probability for each rule the ratio between its weight and the sum of weights for all possible productions.

template<size_t W, typename L, typename... Rs>
struct WRule {
    using Lhs = L;
    using Rhs = List<Rs...>;
    constexpr static auto Weight = W;
};

// normal rules are just 1-weighted rules
template<typename L, typename... Rs>
using Rule = WRule<1, L, Rs...>;

Now, we need two more templates: one to calculate the sum of the weights in a list, the other to do weighted selection.

The first one is easy enough to write, if we use a fold expression:

template<typename> struct WeightSum;

template<typename ... Rs>
struct WeightSum<List<Rs...>> {
    constexpr static auto Result = (0 + ... + Rs::Weight);
};

Weighted selection is somewhat harder. I implemented the main part as a constexpr function which returns the index of the appropriate rule, which I can then feed back into my earlier Select utility template:

template<size_t Idx, size_t WeightIdx, typename H, typename... Ts>
constexpr size_t WSelectImpl() {
    if constexpr (WeightIdx < H::Weight)
        return Idx;
    else
        return WSelectImpl<Idx + 1, WeightIdx - H::Weight, Ts...>();
}

template<size_t, typename> struct WSelect;
template<size_t I, typename... Ts>
struct WSelect<I, List<Ts...>> {
    using Result = typename Select<WSelectImpl<0, I, Ts...>(), List<Ts...>>::Result;
};

My initial implementation was a recursive template like Select, but using std::conditional because I couldn’t think of another way to determine the base case. This would result in a couple extra template instantiations than with Select, because std::conditional would expand the false branch even after the matching index was reached. I didn’t notice either approach to be noticeably faster than the other when I benchmarked them.

Usage

In the repository I linked above, there’s a example grammar using weighted rules. It’s the classic math expression grammar that anyone who’s taken a class covering parsers has probably seen:

using WeightedMath = Grammar<
    Expr,

    WRule<6, Expr, Term>,                   // E -> T
    WRule<2, Expr, Expr, Plus, Term>,       // E -> E + T
    WRule<2, Expr, Expr, Minus, Term>,      // E -> E - T

    WRule<6, Term, Factor>,                 // T -> F
    WRule<2, Term, Term, Times, Factor>,    // T -> T * F
    WRule<2, Term, Term, Div, Factor>,      // T -> T / F

    WRule<8, Factor, Digits>,               // F -> [0-9]+
    WRule<2, Factor, LParen, Expr, RParen>, // F -> ( E )

    WRule<4, Digits, Digit>,
    WRule<6, Digits, Digits, Digit>,

    WRule<0, Digit, n0>,
    WRule<0, Digit, n1>,
    WRule<1, Digit, n2>,
    WRule<1, Digit, n3>,
    WRule<1, Digit, n4>,
    WRule<0, Digit, n5>,
    WRule<1, Digit, n6>,
    WRule<0, Digit, n7>,
    WRule<1, Digit, n8>,
    WRule<1, Digit, n9>
>;

And some example productions:

int main() {
    std::cout << "math:\n";
    print<WeightedMath, 15>();
}
[brianheim@BrianMBP constexpr_grammar]$ g++-8 -o main main.cpp -std=c++17
[brianheim@BrianMBP constexpr_grammar]$ ./main
math:
1: 48/62683/3/88*638*3*(92639-28882)
2: 2842
3: 38
4: 63294392668/((9/8883*26))
5: 83384
6: 486+42
7: (994)*84+6/684-68236+29*323*3
8: 49689+44-9+866-823
9: 89-228224
10: 2
11: 962*((6/9/(9-936889936/242494+4)*446+(3/4+3))*69*(3*6482+88868-6-893-88-(936*6)+844))
12: 6
13: 68
14: (468-(2)+(998*9*8+388*((8283289)/(692/(849883)/98)/68*8)-3393939489)/4484/89832)/(86)
15: 2

Conclusion

So there you have it, a compile-time, context-free grammar implementation in C++17.

As I mentioned above, I’ve posted the full code on github here. I’d welcome any feedback (you can reach me via personal email) since this is my first larger metaprogramming project.

C u soon