Making string concatenation in GavelScript FAST

217 views foss open-source c++ gavelscript

by CPunch

I've been working on my single-header scripting language, GavelScript, for months now, even doing a complete rewrite! I've become very proud of the progress so far (even achieving speeds comparable to other scripting languages!) Today I wanted to explain in good old Open Source fashion how YOU can add custom operators and even instructions in the GavelScript VM!

Let's first talk about what exactly we'll be adding. String concatenation! String concatenation is the act of adding two or more strings together, and in current GavelScript looks like:

[ This is GS's built-in disassembler. ]

Now you may be wondering, "Why does this matter? It's not like you can change the way OP_ADD works, it would break the whole language!" And you would be right, however we can always add new instructions...

Let's take a step back and look at why exactly OP_ADD is bad for concatenating strings. Here's the actual instruction in the VM of OP_ADD.

First thing the instruction does is grab the arguments being added from the stack. Then if either of the args are strings or characters, concatenate the two args and push the result! (Then maybe trigger a Garbage Collection but thats out of scope for this post.) First of all, adding strings like that takes a lot of effort to do, we'll get back to that later. Secondly, because Gavel::addString is written to use a technique called "string interning" (which means that strings are reused and it makes string comparison much faster) its very costly to add new strings (which we do every OP_ADD instruction with strings.)

Now my solution is to move string concatenation out of OP_ADD into a new instruction called OP_CONCAT. OP_CONCAT will concatenate multiple strings at the same time, and then add the final string to the VM for string interning and garbage collection, saving a lot of wasted cycles. Lets first add it to the OPCODE enum.

I also added the instruction type to our GInstructionTypes lookup table & getOpCodeName(OPCODE op) so our disassembler knows how to treat the instruction! Now lets add the instruction to our VM.

Let's break this down, first thing I do is get the Ax argument from the instruction, this tells us how many values on the stack need to be converted to strings and concatenated. After that I setup some temporary values like the std::vector so I don't have to convert the values to strings twice. I compute the size of the end string by grabbing the GValues from the stack, converting them to strings putting those strings in the vector and adding the length to the total size. Then I use the total size to allocate a character array with room for a null terminator, and memcpy all of the strings to the buffer and place the null terminator at the very end. Finally I pop all of the values from the stack && push the finished string to the stack.

While this isn't the prettiest code in the world, it is fairly fast and works well. If you're familiar with the Lua VM this might look vaguely similar to luaV_concat (which is what i based this on). Anyways let's go ahead and add support for this instruction in our parser. If you didn't know GavelScript uses a Pratt-based, which is a "recursive decent parser" which means that supporting this is fairly easy.

First thing I did was add a new enum to GTokenType for our new DOT_DOT token & a corresponding precedent level enum. I also added it to the parse lookup table. PARSEFIX_CONCAT is an infix operator, which means that a token with a higher precedence level was already processed before (in our case, the first GValue to concatenate the rest too.)

Then in scanNextToken() I changed the handling of '.' to check for another '.' after, and if it did it's a DOT_DOT token!

Then I added the handler for the DOT_DOT token in runParseFix(Token token, ParseFix rule, bool canAssign)! In this, I parse the next expression that has a precedence level of PREC_TERM until there's no more DOT_DOT tokens. Parsing those expressions puts them on the stack so then all I need to do is emit the instruction and let the parser know that the OP_CONCAT instruction popped the values off the stack and the result of the concatenation is left on the stack! As a final touch, i'll remove the string concatenation feature of OP_ADD. This is mostly just to really force people to use the efficient replacement I made. Sue me. [view the whole commit here]

Now lets look at the chunks disassembly with out new OP_CONCAT instruction.

Look! Only 1 instruction to concatenate all of those strings! This is also a little helper for our VM, executing less instructions to do more work :). Now what would be the whole point if I didn't show some real world improvements. So I wrote two scripts, one that makes a million strings using the old OP_ADD method, and the other that uses the new OP_CONCAT method.

// using OP_ADD (NOTE: this no longer works in the current VM!)
for (var i = 0; i < 1000000; i++) do
    print(i + '.' + " strings!")

// using OP_CONCAT
for (var i = 0; i < 1000000; i++) do
    print(i .. '.' .. " strings!")

Here are the results with our old OP_ADD:

and with the new OP_CONCAT:

That's almost 3 whole seconds of shaved off time! I'm really glad to be able to share this, it's an amazingly rewarding passion to see huge amounts of progress overtime. I definitely never in my life thought I could say "I made a scripting language." If you'd like to learn more about the internals of scripting languages there's this amazingly good read all about how they're built by Robert Nystrom. Check it out!

Jul 15, 2020 by CPunch