Solarian Programmer

My programming ramblings

Writing a minimal x86-64 JIT compiler in C++ - Part 2

Posted on January 12, 2018 by Paul

In my last article, I’ve shown you how to generate the machine code for a function at runtime, copy this code in a part of the memory, marked as executable, and call it from C++. Now, we’ll go the other way around, we’ll call a C++ function from a function generated at runtime. Like before, I assume that you try the code on Linux or macOS.

If you remember from part 1, we’ve started by adding machine code instructions in an std::vector and copying this code to an executable memory page. While this was a fine approach from a didactic point of view, in practice, you will probably want to write the code directly to the executable memory. Here is an example of how I propose to do it:

1     MemoryPages mp;
2     mp.push(0x48); mp.push(0xb8);

The object mp, from the above piece of code, will ask the OS for memory, release this memory when it is not needed and will have some helper member functions that will let us push pieces of machine code to the executable memory. We can also add safety features, e.g. a mechanism to check if we can push more data on the executable memory or if we’ve reached the bounds of the allocated memory pages.

For simplicity, I will keep the entire code of this example in a single source file. We can split it later in more files if it grows too big.

Let’s start by writing the code for MemoryPages:

 1 #include <iostream>
 2 #include <string>
 3 #include <vector>
 4 #include <stdexcept>
 5 
 6 #include <cstring>
 7 #include <unistd.h>
 8 #include <sys/mman.h>
 9 
10 struct MemoryPages {
11     uint8_t *mem;                   // Pointer to the start of the executable memory
12     size_t page_size;               // OS defined memory page size (typically 4096 bytes)
13     size_t pages = 0;               // no of memory pages requested from the OS
14     size_t position = 0;            // current position to the non used memory space
15 
16 // ...
17 };

In the above, position points to the beginning of the non used memory area. This grows as we push more machine code to the executable memory.

Next, we basically copy the code that asks for executable memory from the previous tutorial to the struct constructor and the code that releases this memory to the destructor:

 1 // ....
 2 
 3 struct MemoryPages {
 4     // ...
 5     MemoryPages(size_t pages_requested = 1) {
 6         page_size = sysconf(_SC_PAGE_SIZE); // Get the machine page size
 7         mem = (uint8_t*) mmap(NULL, page_size * pages_requested, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS ,-1, 0);
 8         if(mem == MAP_FAILED) {
 9             throw std::runtime_error("Can't allocate enough executable memory!");
10         }
11         pages = pages_requested;
12     }
13 
14     ~MemoryPages() {
15         munmap(mem, pages * page_size);
16     }
17     // ...
18 };

Please note that by default the constructor will allocate a single page of memory, pass the required number of pages if you need more. If you want to use this example in production, I suggest to implement a mechanism to ask for more memory pages only when needed, see the mremap documentation for hints. For our purposes, one page of memory is more than enough and will keep the code simpler.

Implementing a member function that pushes a byte of data to the memory is straightforward:

 1 // ....
 2 
 3 struct MemoryPages {
 4     // ...
 5 
 6     // Push an uint8_t number to the memory
 7     void push(uint8_t data) {
 8         check_available_space(sizeof data);
 9         mem[position] = data;
10         position++;
11     }
12     // ...
13 };

Last time, we’ve used an explicit approach to push numbers larger than one byte to the memory for didactic purposes (by manually extracting the bytes from the input number and adding them one by one in reverse order, as is the default for little-endian byte order of the Intel processor). It is more efficient and less error prone to use a function like std::memcpy to copy the individual bytes of a larger number, in the correct byte order, for the machine on which the code runs. We’ll use memcpy to copy the address of a function pointer to the memory in the next push function:

 1 // ....
 2 
 3 struct MemoryPages {
 4     // ...
 5 
 6     // Push a function pointer to the memory
 7     void push(void (*fn)()) {
 8         size_t fn_address = reinterpret_cast<size_t>(fn);
 9         check_available_space(sizeof fn_address);
10 
11         std::memcpy((mem + position), &fn_address, sizeof fn_address);
12         position += sizeof fn_address;
13     }
14     // ...
15 };

In some cases, the machine code for a particular set of instructions is just a set of bytes, e.g.:

1     mov	rbp, rsp

is translated to machine code as:

1     0x48, 0x89, 0xe5

it could be useful to have another push function that will receive as input a std::vector of uint8_t numbers:

 1 // ....
 2 
 3 struct MemoryPages {
 4     // ...
 5 
 6     // Push a vector of uint8_t numbers to the memory
 7     void push(const std::vector<uint8_t> &data) {
 8         check_available_space(data.size());
 9 
10         std::memcpy((mem + position), &data[0], data.size());
11         position += data.size();
12     }
13     // ...
14 };

The code that checks if we can copy some data to the memory:

 1 // ....
 2 
 3 struct MemoryPages {
 4     // ...
 5     // Check if it there is enough available space to push some data to the memory
 6     void check_available_space(size_t data_size) {
 7         if(position + data_size > pages * page_size) {
 8             throw std::runtime_error("Not enough virtual memory allocated!");
 9         }
10     }
11     // ...
12 };

As suggested earlier, if you intend to use this code in production, it will be a good idea to use mremap to ask for more memory pages, if necessary, and throw an error only if the OS can’t satisfy the demand.

Finally, we could add a helper function to print the content of the occupied memory:

 1 // ....
 2 
 3 struct MemoryPages {
 4     // ...
 5     // Print the content of used memory
 6     void show_memory() {
 7         std::cout << "\nMemory content: " << position << "/" << pages * page_size << " bytes used\n";
 8         std::cout << std::hex;
 9         for(size_t i = 0; i < position; ++i) {
10             std::cout << "0x" << (int) mem[i] << " ";
11             if(i % 16 == 0 && i > 0) {
12                 std::cout << '\n';
13             }
14         }
15         std::cout << std::dec;
16         std::cout << "\n\n";
17     }
18 };

Now, that we’ve finished abstracting the main ideas from the previous article, we can get to the juicy bits of the current one - calling an existing C++ function from our generated machine code at runtime. Let’s simplify a bit the problem and investigate how we can call a C++ function that receives no argument and returns nothing.

OK, so let’s write a C++ function that prints a message and (bare with me) modifies a global variable. I know that using globals is usually a bad practice, but it will allow me to illustrate that calling a C++ function from our code generated at runtime can have side effects. It will also be useful for the next article in this series, that will implement a mini Forth interpreter that can JIT compile user defined functions.

 1 // ....
 2 
 3 struct MemoryPages {
 4     // ...
 5 };
 6 
 7 // Global vector that is modified by test()
 8 std::vector<int> a{1, 2, 3};
 9 
10 // Function to be called from our generated machine code
11 void test() {
12     printf("Ohhh, boy ...\n");
13     for(auto &e : a) {
14         e -= 5;
15     }
16 }
17 
18 // ....

The idea is to call the function test() from a function generated at runtime, say func. This is how our func function could look in Assembly:

1 func():
2     push rbp
3     mov rbp, rsp
4 
5     call test()
6 
7     pop rbp
8     ret

We can further simplify the above code by converting line 5 from above to:

1 func():
2     push rbp
3     mov rbp, rsp
4 
5     movabs rax, 0x0		# replace with the address of the called function
6     call rax
7 
8     pop rbp
9     ret

The body of the above function can look like this in machine code:

1    0:	55                   	push   rbp
2    1:	48 89 e5             	mov    rbp,rsp
3 
4    4:	48 b8 00 00 00 00 00 	movabs rax,0x0
5    b:	00 00 00
6    e:	ff d0                	call   rax
7 
8   10:	5d                   	pop    rbp
9   11:	c3                   	ret

In the above code, the first two lines of the function body are called prologue and the last two lines are called epilogue and they are, by convention, repeated in all new functions. You can read more here. It make sense to put these two chunks of machine code in two separate variables and use these when we need them:

 1 // ....
 2 
 3 struct MemoryPages {
 4     // ...
 5 };
 6 
 7 namespace AssemblyChunks {
 8     std::vector<uint8_t>function_prologue {
 9         0x55,               // push rbp
10         0x48, 0x89, 0xe5,   // mov	rbp, rsp
11     };
12 
13     std::vector<uint8_t>function_epilogue {
14         0x5d,   // pop	rbp
15         0xc3    // ret
16     };
17 }
18 
19 // Global vector that is modified by test()
20 std::vector<int> a{1, 2, 3};
21 
22 // Function to be called from our generated machine code
23 void test() {
24     // ....
25 }
26 
27 // ....

Next, we can write the main program. First, we create an instance of MemoryPages and we push the required machine code:

 1 // ...
 2 
 3 int main() {
 4     // Instance of exec mem structure
 5     MemoryPages mp;
 6 
 7     // Push prologue
 8     mp.push(AssemblyChunks::function_prologue);
 9 
10     // Push the call to the C++ function test (actually we push the address of the test function)
11     mp.push(0x48); mp.push(0xb8); mp.push(test);    // movabs rax, <function_address>
12     mp.push(0xff); mp.push(0xd0);                   // call rax
13 
14     // Push epilogue and print the generated code
15     mp.push(AssemblyChunks::function_epilogue);
16     mp.show_memory();
17 }

If we run the above code, we should see the generated machine code. Here is what I see on a macOS machine:

1 Memory content: 18/4096 bytes used
2 0x55 0x48 0x89 0xe5 0x48 0xb8 0x60 0x9f 0xf4 0xd 0x1 0x0 0x0 0x0 0xff 0xd0 0x5d
3 0xc3

At this point, all we have to do is to cast the address of our generated code to a function pointer and call the function. We’ll also show the side effects of calling test() on the global variable a:

 1 // ...
 2 
 3 int main() {
 4     // ...
 5 
 6     std::cout << "Global data initial values:\n";
 7     std::cout << a[0] << "\t" << a[1] << "\t" << a[2] << "\n";
 8 
 9     // Cast the address of our generated code to a function pointer and call the function
10     void (*func)() = reinterpret_cast<void (*)()>(mp.mem);
11     func();
12 
13     std::cout << "Global data after test() was called from the generated code:\n";
14     std::cout << a[0] << "\t" << a[1] << "\t" << a[2] << "\n";
15 }

This is what I see on a macOS machine:

 1 $ clang++ -std=c++14 -stdlib=libc++ -Wall -pedantic funcall.cpp -o funcall
 2 $ ./funcall
 3 
 4 Memory content: 18/4096 bytes used
 5 0x55 0x48 0x89 0xe5 0x48 0xb8 0x40 0x9f 0x52 0x3 0x1 0x0 0x0 0x0 0xff 0xd0 0x5d
 6 0xc3
 7 
 8 Global data initial values:
 9 1	2	3
10 Ohhh, boy ...
11 Global data after test() was called from the generated code:
12 -4	-3	-2
13 $

If you run the code on your machine, you should get identical results, except for the machine code part that stores the address of the C++ function.

You can find the complete source code for the above example on the GitHub repo for this article.

If you are interested to learn more about x86-64 Assembly, I would recommend reading Introduction to 64 bit Assembly Programming for Linux and OS X by Ray Seyfarth:

If you are interested to learn more about modern C++, A Tour of C++ by Bjarne Stroustrup is a decent introduction:


Show Comments