The Approach
Whenever it comes to writing any kind of DSL, I always remember on my college compilers course, and then start writing a Lexer and a Parser.
The Lexer
A lexer (or Lexical Analyzer) takes the source text, and produces a stream of tokens. I often want to write this in a streaming manner, but since I wanted this to be both fast, and the code tiny, and since math formulas are usually relatively short, I opted to just have the lexer dump all the tokens into an Array.
ASCIIMath's syntax is simple:
- Whitespace is largely ignored.
- String constants are
"followed by any number of non-double-quote characters, and then a final". - Numeric constants are digits, with an optional
.character. All of:123,1.23,.123and123.are valid. - There are hundreds of keywords available, and they make up the bulk of the language. The language will attempt to find keywords at the current position in the source text, and if multiple candidates are found, the longest one is selected.
- If none of the above are found, the current character is interpreted as a single-character math identifier, and we advance to the next character.
This sometimes gives surprising results. (eg: foo results in
f (a single-character identifier) plus oo is a keyword for infinity.)The keyword selection logic is the most performance-critical bit: every character needs to look into the giant symbol table, and try to see if there are any keywords starting on this character. I optimized this in two ways:
- Instead of keeping the full symbol table in a large map / array, I key the full list by the first character in that symbol. This cuts down the number of symbols that must be considered significantly.
- We only care about the longest symbol, so the symbols underneath that first-character map are all stored in an array that is sorted by string length. That way, the first match encountered will always be the longest, and we can stop early.
Honestly, (1) alone is enough to speed us up, way faster than the main ASCIIMath library, but (2) added a nice 50% improvement on top.
I decided to tweak the language, just a tiny bit:
A pet peeve of mine is when a language's string syntax doesn't account for a way to encode the quote character. ASCIIMath doesn't. I extended the syntax to allow internal
""chars for a double-quote characters. (It was the option with the lowest lift.)The original library will advance through the source text, index by index through the string. However, since JS strings are all UTF-16, this has an broken behavior with unicode characters outside the Basic Multilingual Plane, like emojis. My library made sure to use the JS
const [first] = stringsyntax to reliably capture the first character in a way that respected unicode, so now you can use emojis in formulas!No, I don't expect this to be used responsibly. 😛
The original language had gaps: A few arrows were omitted, only some of the greek letters were defined, and so on. I added extra keywords for all greek letters, plus all the greek letter variants present in TeX's math symbols, plus a few other keywords to fill in the gaps. Plus, I fixed a number of the keywords that I found in the original library's code that didn't work. (They erroneously included
\characters, possibly from a bad copy-and-paste, and would never be detected by the main library's parser.)
For the token object itself, I chose a kind-of unique approach towards token types. Typically, people make the token type an enum and move on. However, I noticed that sometimes, tokens have many different types, like they are simultaneously operators AND stretchy AND must be visible. So I made my typ property a bit-field, and if you are wondering "is this token stretchy?" you'd do a tok.typ & STRETCHY_MASK. It ended up working out pretty well, especially with regards to file size - esbuild can convert all those bitmasks into nice, succinct constants.
The Parser
Next up, the parser - we need to convert that list of tokens into a full Abstract Syntax Tree. (AST.) Thankfully, ASCIIMath provides an EBNF grammar for the language:
v ::= [A-Za-z] | greek letters | numbers | other constant symbols
u ::= sqrt | text | bb | other unary symbols for font commands
b ::= frac | root | stackrel | other binary symbols
l ::= ( | [ | { | (: | {: | other left brackets
r ::= ) | ] | } | :) | :} | other right brackets
S ::= v | lEr | uS | bSS Simple expression
I ::= S_S | S^S | S_S^S | S Intermediate expression
E ::= IE | I/I ExpressionThis is almost correct, but does miss a few subtle details about the language:
- With parenthesized groups (
lEr) the right brace is optional. - Also, there is a special class of brace that is symmetrical, where the same brace character opens and closes the group. (ie:
|) These have additional requirements. - The params for unary and binary (
uSandbSS) expressions are optional. - If a closing brace is encountered in a leaf (
v) context, that's okay, but ONLY if it's not an argument for a unary or binary, or a subscript / superscript. - There are a few unary / binary commands that cause all syntax besides parenthesis rules to go out the window, and everything within the braces is used as text.
- The
Igroup implies that in divisions, the denominator cannot be another division. But they can, ASCIIMath accepts them, and so do I.
I simplified the grammar into this version:
expr := term*;
term := simp ('_' simp)? ('^' simp)? ('/' term)?;
simp := paren | unary | binary | leaf;
paren := ('(' | '[' | ...) expr (')' | ']' | ...);
unary := ('sqrt' | 'floor' | ...) simp;
binary := ('root' | 'frac' | ...) simp simp;
leaf := (str literal) | (num literal) | (symbol) | (char);To accept this grammar, I just created a basic recursive-descent parser, where each part of the grammar calls a function to accept the next part. I chose this approach for a few reasons:
- Easy.
- Tiny - since parsing is done through a bunch of top-level functions, esbuild's minimizer can do wonders bringing the size of the parser down.
I did have to code in a few special-cases:
- Expressions need to know if they're the top-level expression or not. Top-level expressions will accept errant end-braces as leaf nodes. Other expressions cannot. These expressions ALSO need to know if the end-brace that they're looking for is
- For symmetrical braces, we need to make sure that the other brace both exists, and is the same brace. Else, this is a literal
|token. (This behavior is required for augmented matrixes to work...) The tricky part about this is: We need to know up-front if this is aparennode up front, but if the current token is symmetrical, we'll only know that AFTER recursing into the subsequent expression, so we know where the closing brace is... I supported this by adding the ability to revert changes to the token stack - if we can't find the correct end token after recursing, we'll just revert all tokens that have been shifted off the original list, and return null for "nah, no parenthesis here." I wish there was a better way, but I couldn't think of one... - Also, the function that parses leafs will also get notified if it's in a context that accepts end-braces as tokens. Else, it returns null for "no leafs here", and will then depend on something else RE-requesting the leaf parser with the ability to parse that broken end-brace.
Rendering
This bit ended up being more complex than I expected, taking up like 85-90% of dev time, but there were just SO many special rules.
The code recurses through the AST, and assembles the final MathML string based on the tokens encountered, and their properties. What's worse - this isn't even interesting to write about. It's just "does this token have a subscript? Ok, use <msub>" or "are we adding a hat to the token? Then use <mover> plus an additional <mo> containing the '^' character." On and on and on. Mostly just a big ol chore.
A few interesting points, however:
Some unary / binary expressions want the text behind the their arguments, and not just the AST. To support that, I pass a "context" object containing the original source through all the render methods, and when we get to one of those params, we use the tokens in the argument to determine the indexes that we'll need to slice from the source code. Hacky, but it works.
ASCIIMath supports a wide array of "font" unaries. These will cause all the stuff in their arguments to go through a text transformer, converting characters into the Unicode Mathematical Alphanumeric Symbols block. This gives us neato text styles like
or . To support this, I wrote a function that does that transformation, and had the context object apply that transform whenever text is rendered. One complicating factor: The unicode blocks with all those styled letters are not all contiguous - some letters are missing. So you can't just map Unicode code-points and be done with it. No, you will HAVE to maintain lists of overrides and exceptions. Yay!The EBNF also doesn't account for matrixes - they simply reuse the parentheses' syntax, and have it be a special-case of nested parenthesized groups. So I built that detection logic into the renderer, just to keep the syntax as stock as I could. I'm not a huge fan of that pattern, however, and I might refactor the matrix-detection logic back into the parser at some point...
For augmented matrices, the upstream library uses the
columnlinesattribute in the<mtable>to draw the vertical line in augmented matrixes. This approach works fine in Firefox, since their MathML renderer is very featureful and stable. However, this attribute gets completely ignored in Chrome-based browsers. That's pretty bad, since the resulting matrix without the vertical line looks like a standard matrix, which is not what the user was intending. To work around this, MathFmt will simply use an extra-tall unicode bar character in a column. This looks alright (not great) but at least clearly indicates an augmented matrix in both Firefox and Chrome.
