Stork, Part 2: Developing an Expression Parser

This installment of our series delves into a challenging aspect of scripting language development: creating an expression parser, a fundamental component of any programming language.

One might wonder: Why not leverage existing tools or libraries?

Why not employ Lex, YACC, Bison, Boost Spirit, or even regular expressions?

My wife has observed a peculiar habit of mine: I tend to enumerate reasons when confronted with tough questions, perhaps hoping quantity will compensate for any lack of quality in my response.

Let’s apply this approach now.

  1. I aim to utilize standard C++, specifically C++17. The language itself is quite comprehensive, and I’m resisting the temptation to incorporate anything beyond the standard library.
  2. During the development of my first scripting language, time constraints and a lack of familiarity with tools like Lex or YACC led me to implement everything manually.
  3. Subsequently, I stumbled upon a blog post by a seasoned programming language developer who cautioned against relying on such tools. He argued that these tools address the simpler aspects of language development, ultimately leaving you to grapple with the more intricate challenges. Regrettably, I cannot locate that particular blog post now—it was a time when both the internet and I were in our youth.
  4. Back then, a prevalent meme stated: “You have a problem, and you decide to solve it with regular expressions. Now you have two problems.”
  5. While I am acquainted with Boost Spirit and acknowledge its utility as a library, I still prefer to maintain finer control over the parser.
  6. I derive satisfaction from manually implementing these components. Honestly, I could simply offer this explanation and dispense with the list entirely.

The complete source code is available at my GitHub page.

Token Iterator

This section outlines some modifications made to the code from Part 1.

These changes primarily consist of minor adjustments and fixes. However, a notable enhancement to the existing parsing code is the introduction of the token_iterator class. This class enables us to examine the current token without extracting it from the stream, which proves to be quite convenient.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class tokens_iterator {
private:
  push_back_stream& _stream;
  token _current;
public:
  tokens_iterator(push_back_stream& stream);

  const token& operator*() const;
  const token* operator->() const;

  tokens_iterator& operator++();

  explicit operator bool() const;
};

Initialization occurs with push_back_stream, and subsequent dereferencing and incrementation become possible. An explicit boolean cast allows for checking, evaluating to false when the current token corresponds to the end-of-file (eof).

Statically or Dynamically Typed?

This part necessitates a crucial decision: Will the language be statically or dynamically typed?

Statically typed languages enforce type checking during compilation, whereas dynamically typed languages defer this process to runtime. Consequently, potential errors in dynamically typed languages may lurk undetected until execution.

Statically typed languages clearly hold an advantage in this regard, as early error detection is always desirable. The primary advantage of dynamically typed languages, in my view, lies in the ease of development.

My previous scripting language employed dynamic typing and proved relatively straightforward to implement, including the expression parser. The absence of explicit type checking simplifies the process, relying instead on runtime errors that are relatively easy to handle.

For instance, implementing the binary + operator for numerical addition only requires evaluating both operands as numbers at runtime. If either operand fails this evaluation, an exception is thrown. In my previous implementation, I even incorporated operator overloading by inspecting the runtime type information of variables and operators within expressions.

My initial foray into language development (this being my third attempt) involved type checking during compilation, but it didn’t fully exploit the benefits. Expression evaluation still relied on runtime type information.

For this iteration, I’ve opted for static typing, which has proven to be more challenging than anticipated. However, since I’m not aiming for binary compilation or any form of emulated assembly code, some type information will inherently persist in the compiled output.

Types

Types: Numbers, Strings, Functions, Arrays

Our basic type system will encompass the following:

  • Numbers
  • Strings
  • Functions
  • Arrays

While we may expand this set later, structures (or classes, records, tuples, etc.) will be omitted initially. However, our design will accommodate future extensions, avoiding choices that would make such additions difficult.

Representing types as strings with parsing conventions seemed cumbersome. Implementing them as classes with shared pointers felt excessive.

Ultimately, I settled on a type registry:

1
2
using type = std::variant<simple_type, array_type, function_type>;
using type_handle = const type*;

Basic types are defined within an enum:

1
2
3
4
5
enum struct simple_type {
  nothing,
  number,
  string,
};

The nothing enum member serves as a stand-in for void, which is a reserved keyword in C++.

Array types are represented by a structure containing a type_handle member:

1
2
3
struct array_type {
  type_handle inner_type_id;
};

Note that array lengths are not included in array_type, implying dynamic resizing. We’ll address this using std::deque or a similar mechanism later.

Function types consist of the return type, parameter types, and whether parameters are passed by value or reference:

1
2
3
4
5
6
7
8
struct function_type {
  struct param {
    type_handle type_id;
    bool by_ref;
  };
  type_handle return_type_id;
  std::vector<param> param_type_id;
};

Now, let’s define the class responsible for managing these types:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class type_registry {
private:
  struct types_less{
    bool operator()(const type& t1, const type& t2) const;
  };
  std::set<type, types_less> _types;

  static type void_type;
  static type number_type;
  static type string_type;
public:
  type_registry();

  type_handle get_handle(const type& t);

  static type_handle get_void_handle() {
    return &void_type;
  }

  static type_handle get_number_handle() {
    return &number_type;
  }

  static type_handle get_string_handle() {
    return &string_type;
  }
};

Types are stored in a std::set to ensure pointer stability even when new types are inserted. The get_handle function registers a given type and returns a pointer to it. If the type is already present, it returns a pointer to the existing entry. Convenience functions are provided for retrieving primitive types.

Regarding implicit conversions, numbers will be convertible to strings. This should be safe, as the reverse conversion is not permitted, and the string concatenation operator (..) is distinct from the numerical addition operator. Even if this conversion is later removed, it serves as a valuable exercise. This required modifying the number parser to handle . as a potential concatenation operator instead of solely a decimal point.

Compiler Context

During compilation, various functions require access to information about the code being processed. We’ll store this information in a compiler_context class. For expression parsing, we’ll need to retrieve details about encountered identifiers.

Runtime variables will reside in two containers: a global variable container and a stack. The stack expands upon function calls and scope entry, and it contracts upon function returns and scope exit. Function parameters are pushed onto the stack during calls, and the return value is pushed upon function completion. Therefore, for each function, the stack structure will resemble this:

Math equation
Math equation

Functions will track the absolute index of their return variable, and variables or parameters will be accessed relative to this index.

Currently, functions are treated as constant global identifiers.

Here’s the class definition for identifier information:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class identifier_info {
private:
  type_handle _type_id;
  size_t _index;
  bool _is_global;
  bool _is_constant;
public:
  identifier_info(type_handle type_id, size_t index,
                  bool is_global, bool is_constant);

  type_handle type_id() const;

  size_t index() const;

  bool is_global() const;

  bool is_constant() const;
};

For local variables and function parameters, index returns the index relative to the return value. For global identifiers, it returns the absolute global index.

Three distinct identifier lookup mechanisms will be available in compile_context:

  1. Global identifier lookup, stored by value within compile_context as it remains constant throughout compilation.
  2. Local identifier lookup, represented by a unique_ptr, which is initialized to nullptr in the global scope and within functions. Upon entering a scope, a new local context is created with the previous context as its parent. When exiting a scope, the current context is replaced by its parent.
  3. Function identifier lookup, using a raw pointer. This pointer is nullptr in the global scope and points to the outermost local scope within a function.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class compiler_context {
private:
  global_identifier_lookup _globals;
  function_identifier_lookup* _params;
  std::unique_ptr<local_identifier_lookup> _locals;
  type_registry _types;
public:
  compiler_context();
  type_handle get_handle(const type& t);
  const identifier_info* find(const std::string& name) const;
  const identifier_info* create_identifier(std::string name, 
    type_handle type_id, bool is_constant);
  const identifier_info* create_param(std::string name,
    type_handle type_id);
  void enter_scope();
  void enter_function();
  bool leave_scope();
};

Expression Tree

Parsed expression tokens are transformed into an expression tree, which, like any tree, comprises nodes.

Two types of nodes exist:

  1. Leaf nodes, representing: a) Identifiers b) Numbers c) Strings
  2. Inner nodes, representing operations on child nodes, storing children as unique_ptr.

Each node carries information about its type and whether it yields an lvalue (a value that can appear on the left side of the = operator).

Inner node creation involves attempting to convert the return types of child nodes to the expected types. Permitted implicit conversions include:

  1. lvalue to non-lvalue
  2. Any type to void
  3. number to string

If a conversion is disallowed, a semantic error is thrown.

Here’s the class definition, omitting some helper functions and implementation details:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
enum struct node_operation {
  ...
};

using node_value = std::variant<
  node_operation,
  std::string,
  double,
  identifier
>;

struct node {
private:
  node_value _value;
  std::vector<node_ptr> _children;
  type_handle _type_id;
  bool _lvalue;
public:
  node(
    compiler_context& context,
    node_value value,
    std::vector<node_ptr> children
  );

  const node_value& get_value() const;
  const std::vector<node_ptr>& get_children() const;
  type_handle get_type_id() const;
  bool is_lvalue() const;
  void check_conversion(type_handle type_id, bool lvalue);
};

Most method functionalities are self-explanatory. The check_conversion function verifies if a type can be converted to the provided type_id and lvalue flag according to the conversion rules, throwing an exception if the conversion is not permitted.

Nodes initialized with std::string or double are assigned types string or number respectively and are not considered lvalues. Nodes initialized with identifiers inherit the identifier’s type and are considered lvalues unless the identifier is constant.

Nodes initialized with node operations have their types determined by the operation and, in some cases, the types of their children. Let’s summarize expression types in a table. An & suffix denotes an lvalue. Parentheses group expressions treated identically.

Unary Operations

OperationOperation typex type
++x (--x)number&number&
x++ (x--)numbernumber&
+x (-x ~x !x)numbernumber

Binary Operations

OperationOperation typex typey type
x+y (x-y x*y x/y x\y x%y x&y x|y x^y x<<y x>>y x&&y x||y)numbernumbernumber
x==y (x!=y x<y x>y x<=y x>=y)numbernumber or stringsame as x
x..ystringstringstring
x=ysame as xlvalue of anythingsame as x, without lvalue
x+=y (x-=y x*=y x/=y x\=y x%=y x&=y x|=y x^=y x<<=y x>>=y)number&number&number
x..=ystring&string&string
x,ysame as yvoidanything
x[y]element type of xarray typenumber

Ternary Operations

OperationOperation typex typey typez type
x?y:zsame as ynumberanythingsame as y

Function Call

Function calls introduce additional complexity. For a function with N arguments, the function call operation has N+1 children. The first child represents the function itself, and the remaining children correspond to the arguments.

Implicit passing of arguments by reference is disallowed. Callers must explicitly use the & prefix. Currently, this is not treated as a separate operator but rather as part of the function call parsing logic. If an ampersand is expected but not parsed, the lvalue-ness of the argument is removed by introducing a dummy unary operation called node_operation::param. This operation preserves the child’s type but removes its lvalue property.

During node construction for a function call, an error is generated if an lvalue argument is encountered when the function doesn’t expect one, indicating an accidental ampersand. Interestingly, if treated as an operator, & would have the lowest precedence, as it lacks semantic meaning when parsed within an expression. This might change later.

Expression Parser

Edsger Dijkstra, a prominent figure in computer science, once remarked:

“It is practically impossible to teach good programming to students that have had a prior exposure to BASIC. As potential programmers, they are mentally mutilated beyond hope of regeneration.”

Those fortunate enough to have evaded BASIC’s influence should consider themselves fortunate, having escaped “mental mutilation.”

For the rest of us, the “mutilated” programmers, let’s reminisce about our BASIC coding days. BASIC had the \ operator for integer division, which was quite useful in a language lacking separate integer and floating-point types. I’ve incorporated this operator into Stork, along with \=, which performs integer division followed by assignment.

I suspect JavaScript programmers, for instance, would appreciate such operators. One can only speculate what Dijkstra would say about them if he had witnessed the rise of JavaScript.

Speaking of JavaScript, one of my biggest gripes is the inconsistency in these expressions:

  • "1" - "1" evaluates to 0
  • "1" * "1" evaluates to 1
  • "1" / "1" evaluates to 1
  • "1" + "1" evaluates to "11"

The Croatian hip-hop group “Tram 11,” named after a tram route in Zagreb, had a song with lyrics roughly translating to: “One and one are not two, but 11.” This song predates JavaScript but aptly illustrates the discrepancy.

To circumvent such issues, I’ve prohibited implicit conversions from strings to numbers and introduced the .. operator for concatenation (and ..= for concatenation with assignment). The origin of this operator eludes me—it’s not from BASIC or PHP. I’d rather not Google “Python concatenation,” as Python-related searches tend to give me the creeps. My phobia of snakes combined with “concatenation” is not a pleasant thought. Perhaps I’ll explore this further later using an archaic text-based browser devoid of ASCII art.

Returning to the main topic, our expression parser will be an adaptation of the “Shunting-yard Algorithm.”

This algorithm is relatively straightforward and likely comes to mind for most programmers after some consideration. It can evaluate infix expressions or convert them to reverse Polish notation notation, which is not our goal here.

The algorithm reads operands and operators sequentially from left to right. Operands are pushed onto an operand stack. Operators, however, cannot be evaluated immediately due to precedence rules. They are pushed onto an operator stack.

Before pushing an operator, the algorithm checks if the operator at the top of the stack has higher precedence than the newly read operator. If so, the top operator is evaluated using operands from the operand stack, and the result is pushed back onto the operand stack. This process continues until the entire expression is read. Then, all remaining operators on the operator stack are evaluated with operands from the operand stack, pushing results back onto the operand stack. Ultimately, a single operand remains on the stack, representing the final result.

For operators with equal precedence, left-associative operators are evaluated first; otherwise, right-associative operators take precedence. Operators with the same precedence cannot have different associativity.

While the original algorithm handles parentheses, our implementation will address them recursively for code clarity.

Edsger Dijkstra coined the term “Shunting-yard” due to the algorithm’s resemblance to the operations of a railroad shunting yard.

However, the original algorithm lacks robust error checking, potentially parsing invalid expressions as valid ones. To address this, I’ve introduced a boolean variable to track whether an operator or operand is expected. When expecting operands, unary prefix operators can also be parsed. This ensures comprehensive expression syntax checking, preventing invalid expressions from slipping through.

In Conclusion

Initially, I intended to cover both expression parsing and expression building in this part. However, the complexity exceeded the scope of a single post. This part focused on expression parsing, and I’ll elaborate on building expression objects in the next installment.

If memory serves, it took me a single weekend to create the first version of my first language some 15 years ago. This makes me wonder where things went astray this time.

Rather than acknowledging my advancing age and diminishing wit, I’ll attribute the delay to my demanding two-year-old son, who consumes much of my free time.

As in Part 1, feel free to examine, download, or even compile the code from my GitHub page.

Licensed under CC BY-NC-SA 4.0