Making a Lua Bytecode parser in Python

53 views lua python reverse-engineering

by CPunch

[The repository for this project can be found here]


So recently I've been getting back into Lua, my first scripting language. I've already done a series about manipulating the LuaVM, (which you can read here) but this time I was interested in the LuaVM bytecode, specifically the Lua 5.1 bytecode. If you don't know what bytecode is or even how Lua works, here's a basic rundown:

  • LuaC is the Lua Compiler. Its job is to turn our human readable script into Lua Bytecode ready to be executed by the LVM (LuaVM)
  • This bytecode is everything the LVM needs to run!
  • Constants
  • Locals
  • Protos (The functions)
  • and even Debug information and line info


Now I know what you're thinking "Who cares! Why would I need to know this!" Well, being able to parse bytecode can enable us to do many things! To name a few:

  • We could easily edit precompiled Lua Scripts embedded in a game. :eyes:
  • Read Lua Bytecode disassembly.
  • Help un-obfuscate scripts. :eyes:


Unfortunately the Lua bytecode has no real standard and changes from version to version. Meaning, our Lua 5.1 Bytecode parser won't be able to read Lua 5.3 Bytecode. Another unfortunate thing is that there is NO official documentation for Lua Bytecode since there is no real standard. Luckily however some really cool guy who made ChunkSpy also wrote some super cool paper about the Lua 5.1 Bytecode!! You can read his amazing paper here! He talks about some really important stuff like how the instructions are encoded, and the Lua chunk header.


To start off I'm going to make a basic lua script and use luac to compile it.


print("hello world")


I know, I know, I should be scripting for NASA. But the simplicity of this will help us tiptoe into the deepend of the Lua Bytecode later on.


To compile this script, save it as "epic.lua" and call luac like so:


luac -o epic.luac epic.lua



You won't see much but your script was just compiled into Lua bytecode! if you want you can even try to read the compiled script.



Hmmm a lot of weird symbols. Those symbols before 'epic.luaA' is part of our chunk header, the ones after are our instructions. You can see our 'hello world' and 'print' is readable. Lua stores these as constants and are human readable,,, for the most part.


Anyways lets actually talk about parsing this bytecode. All LVM Bytecode starts with a header. This just lets us know how to correctly parse the bytecode and what version it is. It's read as the following in this order:

  • First 4 bytes are hex 0x1b and 'Lua' - 4 Bytes
  • Lua revision - 1 Byte
  • Lua format - 1 Byte
  • Endian flag - 1 Byte
  • int size - 1 Byte
  • size_t size - 1 Byte
  • instruction size - 1 Byte
  • lua_Number size - 1 Byte
  • integral flag - 1 Byte


or you can just reference the paper lol:


Knowing this, here's some pseudocode for reading the header:


class LuaCompiler:
    def __init__(self):
        self.chunks = []
        self.chunk = {}
        self.index = 0

    def get_byte(self):
        b = self.bytecode[self.index]
        self.index = self.index + 1
        return b

    def get_string(self, size):
        s = "".join(chr(x) for x in self.bytecode[self.index:self.index+size])
        self.index = self.index + size
        return s

    def decode_bytecode(self, bytecode):
        self.bytecode   = bytecode

        self.signature_byte = self.get_byte()
        self.signature = self.get_string(3)
        self.vm_version = self.get_byte()
        self.bytecode_format = self.get_byte()
        self.big_endian = (self.get_byte() == 0)
        self.int_size   = self.get_byte()
        self.size_t     = self.get_byte()
        self.instr_size = self.get_byte() # gets size of instructions
        self.l_number_size = self.get_byte() # size of lua_Number
        self.integral_flag = self.get_byte()
        
        print("Lua VM version: ", hex(self.vm_version))
        print("Big Endian: ", self.big_endian)
        print("int_size: ", self.int_size)
        print("size_t: ", self.size_t)


Now we're going to have to talk about Instructions. The Lua 5.1 VM has 38 different opcodes for 38 instructions. There are 3 main registers, A, B, and C. Not all instructions use all three, with 3 main different types of instructions, each with different ways to encode the registers.

  • iABC - This type of instruction uses all three registers, with each representing an unsigned integer.
  • iABx - This type of instruction uses A and B, both representing an unsigned integer as well.
  • iAsBx - This type of instruction uses A and B, however B can represent a negative number. However the B in this instruction is strange. instead of having 1 big represent the sign, the range is -131071 to 131071. It's encoded as a regular unsigned integer however, so to get the actual number, you subtract 131071.


All instructions start with the opcode [6 bits], and use the A register [8 bits] however are encoded differently per type:

  • iABC - B and C are both 9 bits
  • iABx and iAsBx - B is 18 bits


You can also reference this very helpful chart!


In lua Bytecode, datatypes are stored as the following:

  • Integer - Usually 4 bytes long, integer.
  • Lua_TNumber - all lua numbers are stored as this, 8 Bytes long, double.
  • String
  • First byte is the size of the string
  • Characters
  • Boolean - 1 Byte, only (1 or 0)


Now, let's write some code to read the datatypes before parsing the function chunk.


We'll need to be able to get the binary of bytes for the instructions, to do that I'll be using python's 'bin' function which turns any number into base 2, aka BInary. However we'll still need it to be 32 bits long, so any missing bits I'll just add a tracing 0.


That'll look like this:


# at [p]osition to k
def get_bits(num, p, k):
    # convert number into binary first 
    binary = bin(num) 


    # remove first two characters 
    binary = binary[2:] 


    # fill in missing bits
    for i in range(32 - len(binary)):
        binary = '0' + binary


    end = len(binary) - p + 1
    start = len(binary) - k + 1


    # extract k  bit sub-string 
    kBitSubStr = binary[start : end] 


    # convert extracted sub-string into decimal again 
    return (int(kBitSubStr,2)) 


This method lets us parse the binary of an instruction, and extract the specific bits we want, then convert them back into base 10. Pretty cool :)


However, we still aren't done. We need to parse multiple bytes into a double, integer, etc. To do that, we'll start with what we wrote previously, but with a few more changes. I mainly used python's struct module to be able to parse these bytes.

Here's what that looks like:



class LuaCompiler:
    def __init__(self):
        self.chunks = []
        self.chunk = {}
        self.index = 0

    def get_byte(self):
        b = self.bytecode[self.index]
        self.index = self.index + 1
        return b

    def get_int32(self):
        i = 0
        if (self.big_endian):
            i = int.from_bytes(self.bytecode[self.index:self.index+4], byteorder='big', signed=False)
        else:
            i = int.from_bytes(self.bytecode[self.index:self.index+4], byteorder='little', signed=False)
        self.index = self.index + self.int_size
        return i

    def get_int(self):
        i = 0
        if (self.big_endian):
            i = int.from_bytes(self.bytecode[self.index:self.index+self.int_size], byteorder='big', signed=False)
        else:
            i = int.from_bytes(self.bytecode[self.index:self.index+self.int_size], byteorder='little', signed=False)
        self.index = self.index + self.int_size
        return i

    def get_size_t(self):
        s = ''
        if (self.big_endian):
            s = int.from_bytes(self.bytecode[self.index:self.index+self.size_t], byteorder='big', signed=False)
        else:
            s = int.from_bytes(self.bytecode[self.index:self.index+self.size_t], byteorder='little', signed=False)
        self.index = self.index + self.size_t
        return s

    def get_double(self):
        if self.big_endian:
            f = struct.unpack('>d', bytearray(self.bytecode[self.index:self.index+8]))
        else:
            f = struct.unpack('<d', bytearray(self.bytecode[self.index:self.index+8]))
        self.index = self.index + 8
        return f[0]

    def get_string(self, size):
        if (size == None):
            size = self.get_size_t()
            if (size == 0):
                return None
        
        s = "".join(chr(x) for x in self.bytecode[self.index:self.index+size])
        self.index = self.index + size
        return s

    def decode_bytecode(self, bytecode):
        self.bytecode   = bytecode

        self.signature_byte = self.get_byte()
        self.signature = self.get_string(3)
        self.vm_version = self.get_byte()
        self.bytecode_format = self.get_byte()
        self.big_endian = (self.get_byte() == 0)
        self.int_size   = self.get_byte()
        self.size_t     = self.get_byte()
        self.instr_size = self.get_byte() # gets size of instructions
        self.l_number_size = self.get_byte() # size of lua_Number
        self.integral_flag = self.get_byte()
        
        print("Lua VM version: ", hex(self.vm_version))
        print("Big Endian: ", self.big_endian)
        print("int_size: ", self.int_size)
        print("size_t: ", self.size_t)


Alright now that we have that, we can decode our proto chunks.


After the header is the first function chunk. This includes:

  • Name - dynamic size
  • First line - Integer
  • Last line - integer
  • Upvalues - 1 Byte
  • Arguments - 1 Byte
  • VArg - 1 Byte
  • Stack - 1 Byte
  • Instuction list
  • Number of instructions - Integer
  • Instruction list - Dynamic size
  • Constant list
  • Number of constants - Integer
  • Constants are stored as:
  • first byte is type of constant - 1 Byte
  • 4 main types of constants:
  • Type == 0: Nil - No data, ignore
  • Type == 1: Boolean - 1 Byte (1 or 0)
  • Type == 3: Lua_TNumber - 8 Bytes
  • Type == 4: String - Dynamic size, first byte is length of characters.
  • List of constants: - Dynamic size
  • Function prototypes
  • Number of protos - Integer
  • Functions chunks - Dynamic, it's big lol
  • Source lines
  • Number of lines - Integer
  • Lines - Integer, Dynamic size
  • Local List
  • Number of locals - Integer
  • Each local is stored as:
  • name - String
  • start line - Int
  • end line - Int
  • Upvalue list
  • Number of Upvalues - Integer
  • List of strings - strings, the names of the Upvalue


Here's the code equivalent:


    def decode_chunk(self):
        chunk = {
            'INSTRUCTIONS': {},
            'CONSTANTS': {},
            'PROTOTYPES': {}
        }

        chunk['NAME'] = self.get_string(None)
        chunk['FIRST_LINE'] = self.get_int()
        chunk['LAST_LINE'] = self.get_int()

        chunk['UPVALUES'] = self.get_byte()
        chunk['ARGUMENTS'] = self.get_byte()
        chunk['VARG'] = self.get_byte()
        chunk['STACK'] = self.get_byte()

        if (not chunk['NAME'] == None):
            chunk['NAME'] = chunk['NAME'][1:-1]

        # parse instructions
        print("** DECODING INSTRUCTIONS")

        num = self.get_int()
        for i in range(num):
            instruction = {
                # opcode = opcode number;
                # type   = [ABC, ABx, AsBx]
                # A, B, C, Bx, or sBx depending on type
            }

            data   = self.get_int32()
            opcode = get_bits(data, 1, 6)
            tp   = lua_opcode_types[opcode]

            instruction['OPCODE'] = opcode
            instruction['TYPE'] = tp
            instruction['A'] = get_bits(data, 7, 14)

            if instruction['TYPE'] == "ABC":
                instruction['B'] = get_bits(data, 24, 32)
                instruction['C'] = get_bits(data, 15, 23)
            elif instruction['TYPE'] == "ABx":
                instruction['Bx'] = get_bits(data, 15, 32)
            elif instruction['TYPE'] == "AsBx":
                instruction['sBx'] = get_bits(data, 15, 32) - 131071

            chunk['INSTRUCTIONS'][i] = instruction

            print(lua_opcode_names[opcode], instruction)

        # get constants
        print("** DECODING CONSTANTS")

        num = self.get_int()
        for i in range(num):
            constant = {
                # type = constant type;
                # data = constant data;
            }
            constant['TYPE'] = self.get_byte()

            if constant['TYPE'] == 1:
                constant['DATA'] = (self.get_byte() != 0)
            elif constant['TYPE'] == 3:
                constant['DATA'] = self.get_double()
            elif constant['TYPE'] == 4:
                constant['DATA'] = self.get_string(None)[:-1]

            print(constant)

            chunk['CONSTANTS'][i] = constant

        # parse protos

        print("** DECODING PROTOS")

        num = self.get_int()
        for i in range(num):
            chunk['PROTOTYPES'][i] = self.decode_chunk()

        # debug stuff
        print("** DECODING DEBUG SYMBOLS")

        # line numbers
        num = self.get_int()
        for i in range(num):
            self.get_int32()

        # locals
        num = self.get_int()
        for i in range(num):
            print(self.get_string(None)[:-1]) # local name
            self.get_int32() # local start PC
            self.get_int32() # local end   PC

        # upvalues
        num = self.get_int()
        for i in range(num):
            self.get_string(None) # upvalue name

        self.chunks.append(chunk)

        return chunk 


So, using this, let's go back to where we started. Let's try and parse our epic compile lua bytecode from the beginning.


with open('epic.luac', 'rb') as luac_file:
  bytecode = luac_file.read()
  bytecode   = array.array('b', rawbytecode)
  self.decode_bytecode(bytecode)
  self.decode_chunk()


Your output should look something like:



Congrats!!!! You just wasted a month of your life on a project!!! oh wait, I must be self-projecting again. Anyways, congrats! You have successfully parsed Lua 5.1 Bytecode! Thank you so much for reading this! More projects soon I promise!


Sep 15, 2019 by CPunch