Kyle Miller > Programming > Brainf*ck interpreter in x86-64 assembly

Brainf*ck interpreter in x86-64 assembly

Brainf*ck, if you didn’t already know, is a fairly popular Turing tarpit: it’s a language which is powerful enough to compute anything a general purpose computer can (that is, it’s Turing-complete), but it’s also difficult to use because of how few features it supports (that is, it’s a tarpit). The description comes from Alan Perlis in his Epigrams on Programming, “Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy.” ^[1] There’s more information about the language at http://www.muppetlabs.com/~breadbox/bf/, and even more about esoteric programming languages at http://esolangs.org/.

What I’m going to do is give some background on Brainf*ck and then proceed to show how one would implement a fairly fast interpreter for it in x86-64 assembly without doing anything fancy. This code was written to be compiled on Mac OS X, but with very minor modifications it can be compiled and run on 64-bit Linux as well. The final assembly source is bf.s, and instructions to assemble this are in the section Assembly time.

I did this because I wanted to see what it was like to implement something fairly simple in x86-64 assembly, and I didn’t feel like doing anything particularly useful one afternoon at the end of May 2012.

1. Brainf*ck specification

This information can be found elsewhere, but I give it for completeness.

Programs in Brainf*ck are written using only six characters, and the program is executed using a very simple machine model. The machine has a large array of numbers along with a pointer into this array, called the data pointer (the number to which the pointer is pointing is called the “current number”). The program is stored separately from this array of numbers, and programs have no access to program data. The machine executes each character one-at-a-time and sequentially by the following table:

`>`	Moves the data pointer right
`<`	Moves the data pointer left
`+`	Increments the current number
`-`	Decrements the current number
`.`	Outputs the current number, encoded in ASCII (so 65 emits an `A`)
`,`	Inputs a number and replaces the current number with it, again in ASCII encoding
`[`	If the current number is zero, then the normal execution flow is broken and the machine jumps forward to the character after the matching `]`
`]`	If the current number is not zero, then the normal execution flow is broken and the machine jumps backward to the character after the matching `]`

It is invalid for a Brainf*ck program to have mismatched [’s and ]’s. Also, if the control flow ever falls off the end of the program, then that’s considered to be an instruction to halt execution.

The most complicated characters to define are definitely [ and ], but it’s best to think of them together as the definition for loops. For instance, the code [->+<] is a loop which ends up setting the current number to zero while adding it to the number immediately to its right. This kind of construct is useful for constructing numbers; notice that the only ways to manipulate numbers are + and -, but no one wants to write n +’s just to load n into an array cell! Let’s say we want to print out A. What we can do, at the expense of using an extra memory cell, is write ++++++[->++++++++++<]>+++++., which multiplies six by ten, adds five to the result, and then prints it.

There are some details missing from this basic description of a Brainf*ck machine. What’s left is to decide

what kind of numbers are in the array,
what size is the array,
what to do in the event the data pointer goes off one of the ends of the array, and
where the data pointer initially points to.

It’s common to let the numbers be 8-bit integers with overflow (so 255 + 1 = 0, and 0 − 1 = 255), so this is what we’ll choose for our interpreter. It’s also common for the data array to have at least 30,000 cells, so we’ll allocate 4 megabytes, just to be safe. For item 3, it’s not clear what is the best way to handle the data pointer going off the ends of the array. One valid way to handle it is to just say it’s an error, or another way is to make the data pointer wrap around the address space (making it circular). We’ll do a third option, which is to make decrementing and incrementing the data pointer have no effect when it’s at the beginning and end of the data array, respectively, which lets a Brainf*ck program detect when the data pointer is at the bounds of the data space. Finally, to what should the data pointer initially be pointing? We’re just going to say the left-most array element. Why? Convenience. Or you could argue that this is the convention for Turing machines in Sipser’s book Theory of Computation.

One other consideration is what to do when there is no more data from the user because of “end-of-file” when trying to execute ,. In this case, we pretend that the user has input the null character, which is not ideal since the user might have actually intended to input the null character. It’s left as an exercise to the reader to modify the code to make it more reasonable (which might involve changing the size of the numbers to 16 bits).

2. To bytecode

How are we going to implement this machine on the x86? We would like to be able to support [ and ] in a reasonably optimal way, and we would also like to optimize the if/elseif chain at the core of the interpreter which executes each of the instructions. The way we’ll satisfy these desires is to create a bytecode which is a little closer to x86, and then compile Brainf*ck into this bytecode before execution.

A straight-forward approach would be to store each of the characters sequentially in memory, and then when the time arises to perform a jump, we count the [ and ] characters until we find the matching bracket. Unfortunately, this takes linear time in the length of the program, and it also doesn’t take advantage of the random memory access which the x86 supports. A slightly better method is to design a bytecode which has absolute jumps with hard-coded destinations for the brackets, which is what we’ll do. We can also choose encodings so it’s possible to implement the core of the interpreter as a branch table. The [->+<] example in this bytecode scheme could be encoded as

0x100001168: jz 0x10000117e
0x100001171: dec
0x100001172: incdp
0x100001173: inc
0x100001174: decdp
0x100001175: jnz 0x100001171

where the mnemonics are for the following bytecodes:

`incdp`	Encoded as `0x01`. Essentially `>`
`decdp`	Encoded as `0x02`. Essentially `<`
`inc`	Encoded as `0x03`. Essentially `+`
`dec`	Encoded as `0x04`. Essentially `-`
`print`	Encoded as `0x05`. Essentially `.`
`input`	Encoded as `0x06`. Essentially `,`
`jz addr`	Encoded as `0x07 addr`, where `addr` is an 8-byte address. Jumps to `addr` if the current number is zero, otherwise continues to the next bytecode
`jnz addr`	Encoded as `0x08 addr`, where `addr` is an 8-byte address. Jumps to `addr` if the current number is not zero, otherwise continues to the next bytecode

3. Assembly time

Now, there are two parts to building the interpreter: the compiler which makes bytecode, and the interpreter which executes the bytecode.

We’re going to use nasm for the assembler and gcc for the linker. The commands to compile the assembly to an executable called bf in Mac OS X are

nasm -g -f macho64 -Ox bf.s
gcc -arch x86_64 bf.o -o bf

where bf.s is bf.s, the assembly source file.

We give a few options to nasm: -g puts debug information into the object file so we can enter breakpoints by name in gdb when debugging, -f macho64 tells nasm to use 64-bit assembly and output a Mach-O object file which gcc will understand (for Linux, replace this with -f elf64), and -Ox tells nasm to optimize jumps, which isn’t necessary but doesn’t hurt.

The gcc for Mac OS X requires -arch x86_64 (at least on 10.5) because it would otherwise try to compile a 32-bit binary. This isn’t necessary in 64-bit Linux.

3.1. x86-64 ABI

As a quick overview, I’m just going to quickly go over how the application binary interface (ABI) for x86-64 works for basic function calls. The specification is the System V Application Binary Interface (AMD64 Architecture Processor Supplement). I’m only going to really mention what’s in Section 3.2 and the table on page 15.

An ABI is the definition for exactly how control is passed between different encapsulated pieces of code (functions), which are passed values called arguments when they are passed control and which might return a value when they return control.

Control is passed using call, and control is returned using ret. It is up to the called function to preserve the registers rbx, r12, r13, r14, r15, and rbp between when it gains and returns control, and all other general-purpose registers can be used freely (which is to say it is up to the calling function to save all other registers it might care about across the function call)

The first six arguments to a function are passed in registers rdi, rsi, rdx, rcx, r8, and the rest are pushed in reverse order on the stack. (In contrast, the 32-bit ABI passes everything on the stack; the 64-bit x86 has more registers, so we can get away with using registers to pass arguments, which means we don’t have to do unnecessary memory traffic). When a function returns a value, it passes it in rax.

When a function takes a variable number of arguments (like printf), then rax is set to the number of vector registers being passed to the function. We’re just going to set it to zero since we only use general-purpose registers.

One complication on Mac OS X is that the stack pointer rsp must be 16-byte aligned when calling a function, which I’ve read is so the stack doesn’t need to be realigned by the callee to use SSE instructions. Because of this, we have an sub rsp, 8 to correct the stack pointer in the prologue of our functions since the call instruction pushes an 8-byte return address. This has the effect of allocating 8-bytes of unused local storage on the stack, so it’ll work fine on Linux.

One other complication is that Mac OS X forces programs to use rip-relative addressing so that all code is position-independent. This mode of addressing lets you address relative to the instruction pointer instead of an absolute address. In nasm, this addressing is written as [rel label] where label is some address in the assembly code. This also works fine on Linux.

We only care about the ABI when we are trying to interoperate with libc. Otherwise, we’re going to use registers as we wish, but we’ll try to use the callee-saved registers listed above so we don’t need to save them across libc calls.

3.2. Boilerplate

We need a little bit of code to interface with the operating system. First, we need to tell nasm about the libc functions we’re going to use for this program, which gcc will then link into our executable.

extern _getchar, _putchar, _fopen, _fgetc, _fclose, _printf, _feof, _exit

On Linux, the underscores in front of these functions should be removed in the extern and elsewhere in the program.

The first function we have is the _main function (which should be called main on Linux), which takes a filename as the first argument, opens the file for reading, then calls compile and exec. If more than one argument is passed to bf, then we call dump_code instead of exec to debug the bytecode compiler.

section .data
usage_msg db "Usage: bf filename",10,0
fopen_read db "r",0

...

section .bss
file_pointer resq 1

...

section .text

global _main

_main:
	sub rsp, 8

	;;; if(argc < 2) print usage message
	cmp rdi, 2
	jge no_print_usage
	
	;;; printf("Usage: bf filename\n"); return
	lea rdi, [rel usage_msg]
	xor eax, eax
	call _printf
	xor eax, eax
	add rsp, 8
	ret
no_print_usage:	
	mov r13, rdi

	;;; file_pointer = fopen(argv[1], "r")
	mov rdi, [rsi + 8]
	lea rsi, [rel fopen_read]
	call _fopen
	mov [rel file_pointer], rax

	call compile

	;;; fclose(file_pointer)
	mov rdi, [rel file_pointer]
	call _fclose

        ;;; if(argc > 2) dump_code; return
	cmp r13, 2
	jle nodumpcode
	call dump_code
	xor eax, eax
	add rsp, 8
	ret
nodumpcode:
	call exec
	xor eax, eax
	add rsp, 8
	ret

Note that we save rdi (the first argument to _main, which is usally called argc) into r13, a callee-saved register. This lets us check whether we should call dump_code after rdi is obliterated by the call to _fopen.

We also do the trick of replacing mov rax, 0 with xor rax, rax, but we go one step farther and write xor eax, eax since it takes less space to encode this instruction, and storing anything into the register exx automatically clears the upper 32 bits of rxx.

3.3. The compiler

The compiler reads the file one character at a time, ignoring any character which isn’t one of the six which are part of the language (giving a rudimentary way of commenting programs), and prints an error if there are mismatched brackets. We use the stack to keep track of where we’ve compiled [’s into jz’s. When we run into a ] and compile it into a jnz, we pop an address from the stack into which we place the address after the jnz, and we place the address after this popped address after the jnz. This lets us use the stack for both matching the braces and computing the absolute jump addresses at the same time.

In this code, we save rsp into r12 so that we can check when we are underflowing the stack (and report a mismatched brace). We also use this to revert the stack pointer to it’s original state when there are too many open braces.

The other important register is rbx, which is the pointer into bytecode area (code_area) to which the next compiled bytecode is stored.

section .data
...
mismatched_msg db "Too many ]'s",10,0
mismatched2_msg db "Unmatched open [.",10,0
...

section .bss
file_pointer resq 1
code_area resb 8*1024*1024
data_area resb 4*1024*1024

section .text

...

compile:
	sub rsp, 8
	mov r12, rsp
	lea rbx, [rel code_area]
cloop:  mov rdi, [rel file_pointer]
	call _feof
	test rax, rax
	jnz cdone
	mov rdi, [rel file_pointer]
	call _fgetc
	cmp rax, '>'
	je c_incdp
	cmp rax, '<'
	je c_decdp
	cmp rax, '+'
	je c_inc
	cmp rax, '-'
	je c_dec
	cmp rax, '.'
	je c_print
	cmp rax, ','
	je c_input
	cmp rax, '['
	je c_oloop
	cmp rax, ']'
	je c_cloop
	jmp cloop
cdone:
	mov byte [rbx], 0

	cmp r12, rsp
	jne mismatched2	
	
	add rsp, 8
	ret

c_incdp:
	mov byte [rbx], 1
	inc rbx
	jmp cloop
c_decdp:
	mov byte [rbx], 2
	inc rbx
	jmp cloop
c_inc:
	mov byte [rbx], 3
	inc rbx
	jmp cloop
c_dec:
	mov byte [rbx], 4
	inc rbx
	jmp cloop
c_print:
	mov byte [rbx], 5
	inc rbx
	jmp cloop
c_input:
	mov byte [rbx], 6
	inc rbx
	jmp cloop
c_oloop:
	mov byte [rbx], 7
	push rbx
	add rbx, 9
	jmp cloop
c_cloop:
	mov byte [rbx], 8
	pop rax
	cmp rsp, r12
	jg mismatched
	mov r10, rax
	add r10, 9
	mov [rbx+1], r10
	mov r10, rbx
	add r10, 9
	mov [rax+1], r10
	add rbx, 9
	jmp cloop
mismatched:
	mov rsp, r12
	lea rdi, [rel mismatched_msg]
	xor eax, eax
	call _printf
	mov eax, 1
	call _exit
mismatched2:
	mov rsp, r12
	lea rdi, [rel mismatched2_msg]
	xor eax, eax
	call _printf
	mov eax, 1
        call _exit

For debugging purposes, there’s also a function called dump_code which dumps the bytecode into a human-readable format. The most interesting part of dump_code is that it uses s_ops, which is an array of string pointers, to translate bytecodes into a string, taking advantage of the fact that the bytecodes are 0x01–0x08.

A gotcha which I ran into while implementing dump_code was that modifying the lower byte of a register (by using the subregister dl of rdx, for instance) doesn’t clear any of the rest of rdx, despite the fact that modifying edx would modify the upper bits of rdx. I’d only ever used the 64-bit registers, so this came as a surprise for me. A solution is to just perform and rdx, 0xFF to clear the upper bits. You’ll find and rdx, 0xF in the code below, but that’s because of the observation that each of the bytecodes take at most 4 bits to encode.

section .data
...
dump_reg db "0x%llx: %s",10,0
dump_jump db "0x%llx: %s 0x%llx",10,0

s_incdp db "incdp",0
s_decdp db "decdp",0
s_inc db "inc",0
s_dec db "dec",0
s_print db "print",0
s_input db "input",0
s_jz db "jz",0
s_jnz db "jnz",0

s_ops dq 0, s_incdp, s_decdp, s_inc, s_dec
      dq s_print, s_input, s_jz, s_jnz

...

section .text

...

dump_code:
	sub rsp, 8
	lea rbx, [rel code_area]
	lea r15, [rel s_ops]
.loop:	cmp byte [rbx], 0
	je dump_code.done
	cmp byte [rbx], 7
	jge dump_code.jump
	mov dl, [rbx]
	and rdx, 0xF
	mov rdx, [r15 + 8*rdx]
	mov rsi, rbx
	lea rdi, [rel dump_reg]
	xor eax, eax
	call _printf
	inc rbx
	jmp .loop
.jump:	mov rcx, [rbx+1]
	mov dl, [rbx]
	and rdx, 0xF
	mov rdx, [r15 + 8*rdx]
	mov rsi, rbx
	lea rdi, [rel dump_jump]
	xor eax, eax
	call _printf
	add rbx, 9
	jmp .loop
.done:  add rsp, 8
	ret

3.4. Interpreter

The interpreter is a really tight loop in exec. It loads bytes one at a time into bl which are pointed to by the code pointer r12. Then, jmp [r13 + 8*rbx] runs the appropriate code for the loaded bytecode, where r13 is the address of e_table, the jump table. We take advantage of the fact that code_area is defined in the .bss section, so it is initialized to all zero, hence we know that reading a zero means we’ve run off the end of the program and should therefore halt. We do this check by having the zeroth entry of the jump table be e_ret, which returns from the interpreter. The rest of the code is fairly straightforward.

section .bss
code_area resb 8*1024*1024
data_area resb 4*1024*1024

section .text

...

exec:
	sub rsp, 8
	xor ebx, ebx
	lea rbp, [rel data_area] ;; rbp is data pointer
	mov r15, rbp  		 ;; r15 is data_area
        mov r14, r15
        add r14, 4*1024*1024     ;; r14 is right after the end of data_area
	lea r12, [rel code_area] ;; r12 is code pointer
	lea r13, [rel e_table]	 ;; r13 is e_table
eloop:	mov bl, [r12]
	jmp [r13 + 8*rbx]
edone:
	add rsp, 8
	ret

e_table dq e_ret, e_incdp, e_decdp, e_inc, e_dec
	dq e_print, e_input, e_jz, e_jnz

e_ret:	jmp edone

e_incdp:
	cmp rbp, r14
	je e_incdp.noinc
	inc rbp
.noinc:
	inc r12
	jmp eloop

e_decdp:
	cmp rbp, r15
	je e_decdp.nodec
	dec rbp
.nodec:
	inc r12
	jmp eloop

e_inc:
	inc byte [rbp]
	inc r12
	jmp eloop
e_dec:
	dec byte [rbp]
	inc r12
	jmp eloop

e_print:
	mov rdi, [rbp]
	call _putchar
	inc r12
	jmp eloop
e_input:
	lea rdi, [rel stdin]
	call _feof
	test rax, rax
	jnz e_input.eof
	call _getchar
	mov [rbp], al
	inc r12
	jmp eloop
.eof:	mov byte [rbp], 0
	inc r12
	jmp eloop
stdin dq 0

e_jz:
	cmp byte [rbp], 0
	jz e_jz.dojmp
	add r12, 9
	jmp eloop
.dojmp: mov r12, [r12+1]
	jmp eloop

e_jnz:
	cmp byte [rbp], 0
	jnz e_jnz.dojmp
	add r12, 9
	jmp eloop
.dojmp: mov r12, [r12+1]
	jmp eloop

4. Conclusion

There you have it, a Brainf*ck interpreter in x86-64 assembly.

Some subjective tests have shown that is fairly fast. For instance, the game Lost Kingdom (a text adventure game in 2.1MB of Brainf*ck) runs without any noticable delay between input lines, which I find very surprising because of all the computation that’s done to output text.

There are a number of possible extensions one could try if one wanted to soak up some more time:

Extend the bytecode so that sequences of + and - get compiled into a single add n instruction. This is sure to give a couple-hundred-percent speedup. Something similar goes for sequences of < and >.
Compile the bytecode into raw x86-64 assembly and do away with the almost all all of the jumps. As it is, there’s a jump on each instruction, which is probably terrible for the pipeline.
Make a full optimizing compiler for Brainf*ck. Recognize common idioms and turn them into corresponding optimized code for the x86! An approximation is to convert the code into C and let gcc handle optimization, but I doubt alias analysis is anywhere good enough to generate optimal code.

^[1] http://www-pu.informatik.uni-tuebingen.de/users/klaeren/epigrams.html, which is from SIGPLAN Notices Vol. 17, No. 9, September 1982, pages 7–13.